May Be Good
May Be Good
Julien Rodriguez
L’UNIVERSITÉ DE BORDEAUX
ÉCOLE DOCTORALE
DE MATHÉMATIQUES ET D’INFORMATIQUE
par Julien Rodriguez
POUR OBTENIR LE GRADE DE
DOCTEUR
SPÉCIALITÉ : INFORMATIQUE
ii J. Rodriguez
Title Circuit partitioning for multi-FPGA platforms
iv J. Rodriguez
Contents
Remerciements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Introduction 7
Digital electronic circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 8
Circuit partitioning for rapid prototyping . . . . . . . . . . . . . . . . . . 9
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1 Definitions 13
1.1 Graphs and Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.2 Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.3 Red-Black Hypergraph . . . . . . . . . . . . . . . . . . . . . 20
1.2 Criticality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1 Circuit cell criticality . . . . . . . . . . . . . . . . . . . . . . 24
1.2.2 Vertex and path criticality . . . . . . . . . . . . . . . . . . . 25
1.3 Partitioning and Clustering . . . . . . . . . . . . . . . . . . . . . . 26
1.3.1 Partitions and cut metrics . . . . . . . . . . . . . . . . . . . 26
1.3.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.3 Graph and hypergraph circuit models . . . . . . . . . . . . . 29
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
v
Contents
vi J. Rodriguez
Contents
References 317
Publications 339
Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
viii J. Rodriguez
List of Figures
ix
List of Figures
x J. Rodriguez
List of Figures
xii J. Rodriguez
List of Tables
A.1 Path cost results of clustering algorithms with a maximum size equal
to 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.2 Path cost results of clustering algorithms with a maximum size equal
to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
A.3 Path cost results of clustering algorithms with a maximum size equal
to 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
A.4 Path cost results of clustering algorithms with a maximum size equal
to 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
A.5 Path cost results of clustering algorithms with a maximum size equal
to 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
A.6 Path cost results of clustering algorithms with a maximum size equal
to 64. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
xiii
List of Tables
A.7 Path cost results of clustering algorithms with a maximum size equal
to 128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
A.8 Path cost results of clustering algorithms with a maximum size equal
to 256. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
A.9 Path cost results of clustering algorithms with a maximum size equal
to 512. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.10 Path cost results of clustering algorithms with a maximum size equal
to 1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
A.11 Path cost results of clustering algorithms with a maximum size equal
to 2048. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
A.12 Path cost results of clustering algorithms with a maximum size equal
to 4096. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
A.13 Results for critical path: dΠ max (H), on target T1 for circuits in ITC
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.14 Results for critical path: dΠ max (H), on target T2 for circuits in ITC
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.15 Results for critical path: dΠ max (H), on target T3 for circuits in ITC
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.16 Results for critical path: dΠ max (H), on target T4 for circuits in ITC
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
A.17 Results for critical path: dΠ max (H), on target T5 for circuits in ITC
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
A.18 Results for critical path: dΠ max (H), on target T6 for circuits in ITC
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
A.19 Results for critical path: dΠmax (H), on target T1 for circuits in Chip-
yard and Titan sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 206
A.20 Results for critical path: dΠmax (H), on target T2 for circuits in Chip-
yard and Titan sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.21 Results for critical path: dΠmax (H), on target T3 for circuits in Chip-
yard and Titan sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 208
A.22 Results for critical path: dΠmax (H), on target T4 for circuits in Chip-
yard and Titan sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 209
A.23 Results for critical path: dΠmax (H), on target T5 for circuits in Chip-
yard and Titan sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 210
A.24 Results for critical path: dΠmax (H), on target T6 for circuits in Chip-
yard and Titan sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 211
A.25 Results for connectivity cost: fλ (H Π ), on target T3 for circuits in
ITC set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
A.26 Results for connectivity cost: fλ (H Π ), on target T6 for circuits in
ITC set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
xiv J. Rodriguez
List of Tables
xvi J. Rodriguez
List of Tables
xviii J. Rodriguez
Résumé étendu en français
1
Un FPGA ("Field Programmable Gate Array") est un circuit intégré com-
prenant un grand nombre de ressources logiques programmables et interconnecta-
bles. Ces ressources permettent d’implémenter, par programmation, un circuit
électronique numérique tel qu’un microprocesseur, un accélérateur de calculs ou
un système hybride complexe sur puce. Les FPGA sont largement utilisés dans
le domaine de la conception de circuits intégrés, car ils permettent d’obtenir des
circuits prototypes très rapidement, sans devoir fabriquer la puce sur silicium.
Cependant, certains circuits sont trop grands pour être implantés sur un seul
FPGA. Pour résoudre ce problème, il est possible d’utiliser une plate-forme com-
posée de plusieurs FPGA fortement interconnectés, qui peuvent être considérés
comme un seul FPGA virtuel donnant accès à toutes les ressources de la plate-
forme. Cette solution, bien qu’élégante, pose plusieurs problèmes. En particulier,
les outils existants ne tiennent pas compte de toutes les contraintes du problème
de placement à résoudre pour placer efficacement un circuit sur une plate-forme
multi-FPGA. Par exemple, les fonctions de coût actuelles ne sont pas conçues pour
minimiser le temps de propagation des signaux entre les registres, qui est pourtant
crucial pour la performance du prototype résultant, ni ne prennent en compte les
contraintes de capacité induites par le routage des connexions.
Le processus typique de conception de matériel électronique numérique com-
prend plusieurs étapes, incluant le prototypage, la vérification, le placement et
le routage, qui peuvent concerner de très grands circuits logiques. Les méthodes
mises en œuvre au cours de ces étapes tirent souvent parti d’approches de type
« diviser pour régner », afin de séparer les circuits en sous-circuits de plus petites
tailles. Ces sous-systèmes sont plus faciles à manipuler et visent à réduire le travail
sur le circuit global.
Pour prototyper de grands circuits qui ne peuvent être implantés dans un seul
FPGA, une plate-forme multi-FPGA est nécessaire. Dans ce cas, l’approche « di-
viser pour régner » est utilisée pour diviser les circuits en plusieurs morceaux, un
pour chaque FPGA. Pour produire des partitions valables, il faut tenir compte
des limites de capacité de chaque FPGA et des liens d’interconnexion. En outre,
les partitions doivent éviter d’augmenter la longueur du chemin le plus long, ap-
pelé chemin critique. En effet, pour les circuits VLSI, la longueur du chemin
critique détermine la fréquence maximale à laquelle le circuit peut fonctionner, et
le placement de longs chemins sur plusieurs FPGA est susceptible de dégrader le
chemin critique, du fait du temps de traversée plus long entre les deux composants.
2 J. Rodriguez
Résumé étendu en français
concepteurs de circuits.
Le partitionnement d’hypergraphe multi-contraintes (MCHP) est couramment
utilisé pour résoudre le problème de partitionnement de circuits et de prototy-
page. Dans ce contexte, les sommets de l’hypergraphe modélisent les broches des
composants logiques et les hyperarêtes de l’hypergraphe représentent les fils qui les
relient. Le problème de partitionnement équilibré des graphes est un problème NP-
Difficile [125] pour lequel il n’existe pas de facteur d’approximation constant [11]
à moins que P = N P .
Au cours des 30 dernières années, plusieurs outils de partitionnement d’hyper-
graphes ont été développés, tels que hMetis et son dérivé khMetis, PaToH
et KaHyPar. Ces outils cherchent à minimiser la coupe (ou min-cut) entre les
différentes parties calculées, cette coupe pouvant être mesurée selon différentes
métriques.
Pour répartir les sommets entre les différentes parties, ces outils utilisent un
schéma multi-niveaux, qui se compose de trois phases: contraction, partitionnement
initial, et raffinement. La phase de contraction utilise récursivement une méth-
ode de regroupement pour transformer le problème en un problème plus petit, qui
possède les mêmes propriétés topologiques. Ensuite, un partitionnement initial est
calculé sur le plus petit problème. Enfin, pour chaque niveau, la solution du niveau
le plus grossier est prolongée jusqu’au niveau le plus fin, puis affinée à l’aide d’un
algorithme d’optimisation locale. L’utilisation du schéma multi-niveaux permet de
réduire le temps de calcul par rapport à une approche de partitionnement directe,
les algorithmes de partitionnement les plus coûteux n’étant appliqués qu’aux hy-
pergraphes les plus petits, alors que les algorithmes d’otimisation locale sont moins
gourmands en ressources du fait qu’ils n’opèrent que sur une zone réduite des hy-
pergraphes, à savoir la frontière entre les parties déjà trouvées.
État de l’art
De nombreuses approches ont été tentées pour améliorer les performances du par-
titionnement de circuits. Nous présentons ici quelques travaux récents sur le parti-
tionnement de circuits pour le prototypage rapide qui prennent en compte les con-
traintes de performance. Beaucoup de ces travaux utilisent des outils de partition-
nement min-cut existants, utilisés comme des boîtes noires, au sein d’algorithmes
plus complexes, en pondérant les sommets et arêtes des hypergraphes afin de pren-
dre en compte des contraintes supplémentaires que le partitionneur doit respecter.
Par exemple, dans [4], les auteurs présentent une approche multi-objectif basée
sur hMetis. Les auteurs déterminent un ensemble fini de chemins les plus cri-
tiques à chaque étape de partitionnement, en utilisant un coût tenant compte de
trois facteurs : la longueur du chemin critique, le nombre de fois où les chemins
de longueur critique sont coupés et le poids des hyperarcs associés aux chemins
critiques.
Dans [38], les auteurs comparent une méthode classique utilisant hMetis pour
le partitionnement, suivie d’un algorithme de placement, à une approche dérivée
qui effectue du placement et du routage pendant l’étape de partitionnement.
Les résultats produisent de meilleures valeurs de chemin critique par rapport à
l’approche en deux étapes. Plus récemment, dans [127], les auteurs effectuent
un pré-traitement et un post-traitement de l’hypergraphe considéré afin d’intégrer
l’objectif de minimisation du chemin critique dans la métrique de la taille de coupe,
en utilisant hMetis comme outil de partitionnement. Cependant, la minimisation
de la taille des coupes n’est souvent pas l’objectif le plus pertinent. En outre, le fait
de biaiser les fonctions de coût min-cut pour prendre en compte la minimisation
du coût de chemins est insuffisant.
Contributions
Nos travaux portent sur le calcul du partitionnement équilibré d’hypergraphes
avec minimisation de coût du chemin critique, en plus de l’objectif classique de
la minimisation de coupe, qui est toujours pertinent pour nous car il contribue à
réduire les contraintes de capacité de communication entre FPGA.
La première contribution de cette thèse est la définition d’une représenta-
tion spécifique des circuits électroniques numériques qui consiste en une union
d’hypergraphes acycliques orientés (ou DAH, pour directed acyclic hypergraph) [160].
L’hypergraphe global représentant l’ensemble du circuit est supposé être connexe ;
dans le cas contraire, ses composantes connexes peuvent être traitées indépendam-
ment, aux capacités des FPGA près. Les sommets source et puits de chaque DAH
sont étiquetés en rouge, tandis que les autres sommets sont en noir. Les sommets
rouges représentent généralement des registres et des ports d’entrée sortie (E/S)
et peuvent être partagés entre plusieurs DAH, ce qui rend connexe l’hypergraphe
global. Les sommets noirs représentent des circuits combinatoires. Une fonction de
coût des chemins modélise l’impact d’une coupe sur les chemins des DAHs pendant
le partitionnement. Chaque partition d’un hypergraphe entraînera des coupures
le long de certains chemins, induisant un coût de traversée supplémentaire. Notre
objectif est de trouver une partition qui réduise la longueur maximale du chemin
entre deux sommets rouges, qui correspond au temps minimum nécessaire pour
calculer et sauvegarder les données de l’état du circuit dans les registres, et qui
minimise également la taille de la coupe. Dans notre contexte, nous supposons
que le coût de routage entre les parties puisse être non uniforme.
La plupart des méthodes actuelles de partitionnement de circuits sont basées
sur des outils de partitionnement de graphes à usage général disponibles publique-
4 J. Rodriguez
Résumé étendu en français
Perspectives
Tous nos algorithmes ont été mis en œuvre en langage C, dans un cadre logiciel
composé de structures de données pour représenter notre modèle d’hypergraphe
rouge-noir, ainsi que des routines de service auxiliaires. Le problème de parti-
tionnement adressé dans cette thèse diffère du problème de partitionnement traité
par les outils hMetis, khMetis, KaHyPar et PaToH et, à notre connaissance,
il n’existe pas d’outil public dédié à notre problème de partitionnement. Nous
avons donc décidé de formaliser notre travail dans le logiciel RaiSin. Cet outil
comprend un schéma de partitionnement composé des algorithmes présentés dans
cette thèse, y compris une adaptation de notre algorithme de raffinement appelé
DKFM. Notre algorithme de raffinement tient compte de la topologie mais peut
ne pas être en mesure de fixer un partitionnement initial ne tenant pas compte de
la topologie, car un algorithme de raffinement local n’est pas conçu pour recon-
sidérer les décisions globales. Par conséquent, la prise en compte de la topologie
cible devrait être intégrée dans les algorithmes de partitionnement initial.
Notons également qu’il existe des biais d’approximation sur les chemins, crées
pendant la phase de contraction. En effet, lorsque des sommets sont fusionnés
ensemble, des faux chemins peuvent être crée et pris en compte. Par conséquent le
maintient d’un cohérence sur chemins combinatoire doivent être étudiés et ajoutés
au sein des algorithmes de contractions afin de faciliter le travail pour les algo-
rithmes de partitionnement initiaux et de raffinement.
6 J. Rodriguez
Introduction
7
Digital electronic circuits
8 J. Rodriguez
Introduction
cient resources to perform its designated tasks effectively. This approach is particu-
larly valuable when designing large-scale digital systems, such as high-performance
computing applications, where the complexity and computational demands require
the use of multiple FPGAs working together.
The effectiveness of the partitioning strategy will greatly influence the success
of the multi-FPGA prototype.
A formal definition for the problem of circuit partitioning for multi-FPGAs
platform can be found in Chapter 1, section 1.3.
Contributions
The first contribution of this thesis is the definition of the red-black hypergraph
model, which is an extension of hypergraphs dedicated to the representation of
digital electronic circuits. This representation allows us to better model physical
constraints such as how partitioning affects performance, which is one of the main
objectives addressed in this thesis. Based on this abstraction, we design a cost
model and specific algorithms to solve the circuit partitioning problem for multi-
FPGA platforms. Most current circuit partitioning methods are based on publicly
available general-purpose graph partitioning tools. However, current partitioning
tools use a cost model that is not suited for the problem addressed in this thesis.
More details on current methods and their limitations can be found in Chapter 2.
Existing tools take advantage of the multilevel framework, which consists of
three phases: coarsening, initial partitioning and refinement. The coarsening phase
recursively uses a clustering method to transform the circuit model into a smaller
one. During the second phase, an initial partitioning is computed on the resulting
smaller circuit. Then, in the third phase, an algorithm is applied at each recursion
level to prolong the computed circuit partition to the upper level and subsequently
refine it. The multilevel framework reduces computation time compared to a direct
partitioning approach. Computation time is important because partitioning a large
circuit is a difficult and time-consuming problem. This is why this thesis presents
algorithms for each stage of the multilevel framework.
The initial focus is on clustering algorithms, i.e., algorithms whose goal is to
reduce problem size. These algorithms proceed by merging elements, i.e., two ele-
ments become a single element with a weight equal to the sum of the weights of the
merged elements. For that matter, this thesis proposes a study on the relevance
of the choice of the elements to be merged. Based on this study, an adaptation of
the Heavy Edge Matching (HEM) algorithm is proposed, as well as a clustering
algorithm with variable cluster size, called Binary Search Clustering (BSC). Ex-
perimental results demonstrate that BSC performs better than HEM on our circuit
model. An improved result for the approximation complexity parameterized by
10 J. Rodriguez
Introduction
Outline
The following chapters present the main contributions of the thesis. Chapter 1
introduces the notations and definitions used in this dissertation. Our cost model
and hypergraph structure are defined in this Chapter. Chapter 2 presents the cur-
rent state-of-the-art in circuit partitioning. Chapter 3 introduces the experimental
setup and the methodology to evaluate and compare our algorithms to the state-of-
the-art. Chapter 4 introduces a weighting scheme and two clustering algorithms:
an extension of heavy edge matching and the binary search clustering algorithm,
and an approximation algorithm parameterized by cluster size. Chapter 5 presents
greedy partitioning methods based on path algorithms, as well as an integer pro-
gramming approach. Chapter 6 presents a solution refinement algorithm for the
path length-aware partitioning problem.
Finally, a conclusion and perspectives for this work are presented in the last
chapter of this thesis.
Appendix A.1 presents detailed experimental results obtained using the algo-
rithms presented in the previous chapters and implemented in the open source
RaiSin software. I have written RaiSin during my PhD studies and it is the
testing ground for all of the algorithms that are presented in this thesis.
12 J. Rodriguez
Chapter 1
Definitions
13
1.1. Graphs and Hypergraphs
This chapter presents notations and definitions concerning graphs and hyper-
graphs, and operations attached to them. In Section 1.1 graphs, hypergraphs and
the red-black hypergraph are defined.
The notion of criticality is presented in Section 1.2, and the problem of parti-
tioning for path-length minimization to a target topology is defined in Section 1.3.
A conclusion of this chapter can be found in Section 1.4.
1.1.1 Graphs
def
Definition 1.1.1. A graph G = (V, E) is defined as a set of vertices V , and a set
of edges E. An edge is an unordered set of vertex duplets.
In this thesis, we consider the sets V and E to be finite. Given some graph G,
we denote the set of vertices by V (G) and the set of edges by E(G). The number
of vertices in a graph is called the order of the graph, noted |V (G)|. As V (G) and
E(G) can be weighted, we will denote by WV the set of vertex weights, and WE
the set of the edge weights. The weights can be multivalued.
def
Definition 1.1.2. A directed graph G = (V, A) is defined as a set of vertices V ,
and a set of arcs A. An arc is an ordered set composed of two vertices. The first
vertex is the source and the second one is the sink.
In this thesis, we consider the sets V and A to be finite. We denote by A(G)
the set of arcs of a directed graph G. As V (G) and A(G) can be weighted, we
will define by WV the set of vertex weights, and WA the set of the arc weights. In
the general case, we consider multivalued arc and vertex weights. Examples of a
graph and a directed graph can be found in Figure 1.1.
An edge or an arc of the form {u, u} is called a loop. Whenever an edge exists
in several instances in the set of edges E, it is called a multiple edge. For an arc
that exists in several instances in the set of arcs A, it is called a multiple arc.
14 J. Rodriguez
1. Definitions
For a vertex u, the set of its neighboring vertices Γ(u) is defined as follows:
def
Γ(u) = {v|u ̸= v, {u, v} ∈ E} . (1.1)
An edge or an arc {u, v}, is said to be incident to vertex u and vertex v.
Vertices u and v are the ends of edge or an arc, and are called neighbors.
According to the definition of an arc a = {u, v}, vertex u is considered to be
an incoming neighbor of v, and v is considered to be an outgoing neighbor of u.
Let u be a vertex; the set of its incoming neighbor vertices Γ− (u) is defined as:
def
Γ− (u) = {v|v ̸= u, {v, u} ∈ A} , (1.2)
and; the set of its outgoing neighbor vertices Γ+ (u) is defined as:
def
Γ+ (u) = {v|v ̸= u, {u, v} ∈ A} . (1.3)
a) b) c)
d) e)
Figure 1.2 shows several examples of loops and multiple edges or arcs. Example
a shows a multiple edge {u, v}, that appears twice. A similar version is shown for
arcs in Example b, in which vertex u is the source of the two arcs {u, v}, with
vertex v as their sink. Example c, however, does not show a multiple arc because
the orientation is different: {u, v} and {v, u}. Finally, examples d and e are loops
of the form {u, u} for edges and arcs.
All graphs considered in this work are graphs without loops or multiple edges.
This type of graph is called a simple graph.
Let v be a vertex of a graph G. The degree of vertex v, denoted by δ(v), is
the number of edges (resp. arcs) connected to v, that is, δ(v) = |Γ(v)|. In the
directed case, the indegree δ − (v) is the number of arcs of A(G) of which v is the
sink, that is, δ − (v) = |Γ− (v)|. The outdegree δ + (v) is the number of arcs of A(G)
whose source is v, that is, δ + (v) = |Γ+ (v)|. This notion is extended to the entire
graph as follows: the minimal degree of G, δ(G), is the minimum over all vertices
of V (G) of δ(v):
def
δ(G) = min{δ(v)|v ∈ V (G)} . (1.4)
Similarly, the maximum degree of G, denoted ∆(G), is the maximum of δ(v)
over all vertices v of V (G):
def
∆(G) = max{δ(v)|v ∈ V (G)} . (1.5)
A path between two vertices u and v of a graph G is a sequence of edges (resp.
arcs), {{v0 , v1 }, {v1 , v2 }, ..., {vk−1 , vk }} in E(G) (resp. A(G)) such that u = v0 and
v = vk . The set of vertices in a path p is noted by V (p). The number of vertices in
a path p is defined as |V (p)|. Let P (G) the set of paths for a graph G. The length
of a path p is commonly defined as the sum of the weights of the edges (resp. arcs)
in p. In the unweighted case, the length correspond to the number of edges (resp.
arcs). From the definition of path, we can define a cycle as a path in which the
start and end vertices are identical.
A graph G is said to be connected if there exists a path between any pair of
vertices in the graph. In a graph G, a shortest path between two vertices u and v is
a path of minimal length between u and v. Therefore, the distance between u and
v is the length of a shortest path between them. From the previous definitions, we
define the diameter of a connected graph G, denoted diam(G), by a maximum, for
any pair of vertices u and v of V (G), of the distance between u and v. Let u be
a vertex of a connected graph G; u is said to be peripheral if there exists a vertex
v in G such that the distance between u and v is equal to the diameter of G. For
more details on the definitions of graphs, please refer to [21].
1.1.2 Hypergraphs
First introduced by Claude Berge [21], hypergraphs (resp., directed hypergraphs)
are a generalization of graphs (resp. directed graphs) in which the notion of edge
16 J. Rodriguez
1. Definitions
(resp. arc) is extended to that of hyperedge (resp. hyperarc), which can connect one
or more vertices (resp., one or more source vertices to one or more sink vertices).
In the general case, a hyperarc can have a source equal to its sink.
def
Definition 1.1.3. A hypergraph H = (V, E) is defined as a set of vertices V , and
a set of hyperedges E. A hyperedge is an unordered subset of two or more vertices.
In this thesis, the sets V and E are considered to be finite. To denote the
sets of vertices and hyperedges of a hypergraph H, we can use a similar notation
V (H) and E(H) as the ones defined for graphs. We denote by WV the set of vertex
weights, and by WE the set of hyperedge weights. The weights can be multivalued.
def
Definition 1.1.4. A directed hypergraph H = (V, A) is defined as a set of vertices
V and a set of hyperarcs A. A hyperarc a is a set composed of two subsets of
vertices such that a = {s− , s+ }. These sets are s− (a) ⊂ V and s+ (a) ⊂ V , with
s− (a) the set of sources of the hyperarc, and s+ (a) the set of its sinks.
In this thesis, the sets V and A are considered to be finite. A(H) denotes the
set of hyperarcs of a directed hypergraph H. We denote by WV the set of vertex
weights, and by WA the set of hyperarc weights. Weights can be multivalued.
Figure 1.3 shows examples of hyperedges and hyperarcs.
Figure 1.4 shows several examples of loops and multiple hyperedges and hyper-
arcs. Example a shows a multiple hyperedge in which e0 = {v0 , v1 , v2 , v3 } = e1 =
{v0 , v1 , v2 , v3 }. A similar version is shown for hyperarcs in Example b, in which
vertices v0 and v2 are the sources of the two hyperarcs {{v0 , v2 }, {v1 , v3 }} with
vertices v1 and v3 being their sinks. However, in Example c, there is no multiple
hyperarc because the orientation is different. The sources s− (a0 ) = {v0 , v2 } are
different from s− (a1 ) = {v1 , v3 }. Finally, Examples d and e are loops of the form
{v0 } for hyperedges and, {{v0 }, {v0 }} for hyperarcs.
As with the graphs defined in the previous subsection, we will not consider
hyperloops, multiple hyperedges and multiple hyperarcs.
a) b) c)
d) e)
The set of neighbors can be extended to the hyperedges and the hyperarcs as
follows:
def
Γ(u) = {v|∃e ∈ E(H) s.t. u, v ∈ e, v ̸= u} . (1.6)
In the case of directed hypergraphs, we have to consider the pairs of source and
sink vertices in hyperarcs. Hence, the incoming neighbor set of vertex u is defined
as:
def
Γ− (u) = {v|v ̸= u, ∃a ∈ A(H), u ∈ s+ (a), v ∈ s− (a)} . (1.7)
Let u be a vertex; the set of its outgoing neighbor vertices Γ+ (u) is defined as:
def
Γ+ (u) = {v|v ̸= u, ∃a ∈ A(H), v ∈ s+ (a), u ∈ s− (a)} . (1.8)
As multiple edges and loops are not considered in this work, we assume that E ′
and A′ do not contain any of them, that is, the graph G that models the neighbor
18 J. Rodriguez
1. Definitions
Figure 1.5: An example of a hyperedge and a hyperarc and their respective associ-
ated graph representation. Hyperedge e connects all the vertices and the associated
subgraph is a clique connecting all vertices in e. In the example below, which con-
cerns a hyperarc a, the cartesian product is obtained between the sources of a and
its sinks. We only get the edges that connect the sources with the sinks.
such that u ∈ e0 and v ∈ ek and ∀i ∈ {1, ..., k}, ∃vi ∈ ei−1 ∩ ei . In the directed
case, u must be a source of e0 , i.e., u ∈ s− (e0 ) and v must be a sink of ek , i.e.,
v ∈ s+ (ek ) and ∀i ∈ {1, ..., k}, ∃vi ∈ s+ (ei−1 ) ∩ s− (ei ). In other words, the path
set of a hypergraph is defined by the path set of its 2-section. We denote P (H)
the set of paths of some hypergraph H. Let graph G be the 2-section of H; we
assume that P (H) = P (G). The set of vertices in a path p is defined by V (p).
The number of vertices in a path is defined as |V (p)|.
In a directed graph or hypergraph, if a path exists between two vertices u and
v, then v is said to be reachable from u.
A hypergraph H is said to be connected if there exists a path between any pair
of vertices in the hypergraph.
A topological sort or a topological order, is a strict total order of the vertices of
a directed graph or hypergraph, such that the index of any vertex v is higher than
those of all its in-neighbors. A topological sort of the vertices exists if and only if
the directed graph or hypergraph has no cycles, that is, if it is a directed acyclic
graph (DAG) or hypergraph (DAH). A topological sort allows one to traverse an
acyclic dependency graph or hypergraph so that a vertex is traversed only after
all its dependencies have been traversed.
For more details on the definitions related to hypergraphs, please refer to [21].
In the above, have introduced the hypergraph structures that exist in the lit-
erature. In this thesis, we are going to use these structures to model circuits, in
order to partition them. However, these structures have some limitations, which
we are going to explain in more detail in the following subsection.
20 J. Rodriguez
1. Definitions
of the clock period for use as an input value in another part of the circuit for the
next clock period.
Another main type of cells is the combinatorial cells. Each combinatorial cell
performs a logical operation from its input logic states and produces one or more
logic states as output. Examples of combinatorial cells can vary from a simple
logic gate to a complex arithmetic compute block. Each combinatorial cell has a
processing time associated with it, which depends on the operations performed.
The difference between these two types of cells is that a register updates its state
synchronously at each clock tick, and a combinatorial cell propagates a logic state
immediately after that logic state has been received and processed.
Some works refer to a netlist to model a digital electronic circuit. A netlist is
a list of nets (or wires) and cells, with possible additional data of the represented
digital electronic circuit. As a model of digital electronic circuit, a netlist can
also be represented by a hypergraph. Consequently, we will also refer to digital
electronic circuit, netlist, and hypergraph to denote the same object in the sense
of a list of vertices connected by hyperedges.
Red-black hypergraphs are a subclass of directed hypergraphs in which each
hyperarc contains exactly one source vertex, i.e., |s− (a)| = 1, ∀a ∈ A, so that each
signal has one source cell. Moreover, each vertex is assigned a color, red or black.
Red vertices (resp., black vertices) form a subset of vertex set V , V R ⊂ V (resp.,
V B ⊂ V ), such that V R ∩ V B = ∅. Red vertices model the register cells, and black
vertices model the combinatorial cells. By definition, for u ∈ V (H) :
• if Γ+ (u) = ∅, then u ∈ V R . u is a global input (source) of the circuit;
a)
b)
Figure 1.6: The circuit in a) consists of two inputs, two outputs, and 7 combina-
torial cells. The red-black hypergraph b) models the circuit above. It consists of
4 red vertices corresponding to the sources and sinks of the circuit, and 7 black
vertices corresponding to the combinatorial cells.
combinatorial blocks constrain the clock delay because the computed logic state
values must stabilize at the output registers of each combinatorial block before
the next clock tick. If this constraint is not satisfied, the circuit will behave in an
unexpected manner.
In this thesis, we focus on minimizing the length of the critical path of com-
binatorial circuits. Since we are only interested in minimizing the propagation
times of combinatorial logic between two registers, it makes no sense to consider
the length of paths that comprise any additional register different from the start
and end registers. The red-black hypergraph model enables the extraction of all
combinatorial paths, by considering only paths whose start and end vertices are
red.
Since we are working exclusively on synchronous circuits that are designed
for implementation on FPGA, we are guaranteed that there are no combinatorial
loops, i.e., infinite paths that are made of an infinite repeated pattern of combi-
natorial cells. That is, a cycle in a red-black hypergraph contains at least one red
vertex. As a result, a combinatorial block, such as the one shown in Figure 1.6,
22 J. Rodriguez
1. Definitions
Block A Block B
Figure 1.7: Circuit composed of two combinatorial blocks. The OIx cells are both
outputs of block A and inputs of block B.
with v0 , vn−1 ∈ V R (h) and vi ∈ V B (h), i ∈ {1, ..., n − 2}. Let H = {h0 , ..., hk−1 }
be a red-black hypergraph composed of k DAHs; the set of red-red paths P R (H)
is defined as: [
P R (H) = P R (hi ) . (1.14)
hi ∈H
X X
d(p) = d(u) + d(u, v) . (1.15)
u∈V (p) u∈s− (a),v∈s+ (a),a∈p
In the remainder of this document, in the absence of precision, for any path p, we
will use the terms path length or path cost to refer to the quantity d(p).
By extension, dmax (u, v) can be defined as the value of the longest path between
vertices u and v:
dmax (u, v) = max{d(p), p ∈ Pu,v } , (1.16)
were Pu,v is the set of paths between u and v. Furthermore, in circuit prototyping
on a multi-FPGA platform, the routing process is generally performed after the
partitioning. This thesis only focuses on the partitioning step. As a result, there
is no specific cost associated with connected pair of vertices because, at this step,
there is no routing cost (arc weight). In the absence of precision, we will assume
that the weight of each arc is zero when calculating the length of a path d(p).
Using the above definitions, we can define the critical path of a red-black hy-
pergraph H as:
1.2 Criticality
In this section, the concept of criticality is defined. This notion is the basis of sev-
eral circuit partitioning methods. Criticality is used to model the timing associated
with the cells in a circuit.
24 J. Rodriguez
1. Definitions
Figure 1.8: In this example, delays are propagated from source registers to sink
registers. Only locally maximal value are propagated along paths. This yields a
critical path equal to 7 for the circuit.
Figure 1.9: In this example, critical path lengths are computed by back-
propagation from the sink registers to the source registers. Only maximum critical
value are back-propagated. Then, criticalities for each vertex is equal to the length
of the critical path (7) and the length of the second path (5).
the end of a path is assigned a total propagation time equal to the longest path
between that register and any of its predecessor vertices belonging to the path.
If we back-propagate the maximum path value from the sinks back to the source
registers, we obtain for each vertex a criticality value that is an upper bound on
the value of the longest path traversing it. Figure 1.9 shows an example of critical
path value back-propagation. This process labels cells with a value defined as the
cell’s criticality.
The criticality of a cell is equal to the upper bound of the longest path back-
propagated from the output/sink registers.
rial loops within circuits, red-black hypergraphs have this property, specifically for
any directed acyclic sub-hypergraph. As a result, it is possible to propagate delays
associated with vertices in polynomial time, e.g., using topological sorting, in any
DAH. The back-propagation is performed in reverse topological order.
26 J. Rodriguez
1. Definitions
with WA (a) the weight of hyperarc a. If all hyperarcs have the same weight (equal
to 1), the cut size is equal to |ω(Π)|. Another metric used in some partitioning
problems to measure the quality of partitions is called connectivity-minus-one. The
connectivity-minus-one cost function fλ of some partition Π of a hypergraph H is
defined as:
def
X
fλ = (λΠ (a) − 1) . (1.21)
a∈A
Figure 1.10 shows an example of a partition and its cut and connectivity costs.
In contrast to the min-cut function fc , which counts the cut hyperedges, the
connectivity cost function counts the number of parts in which the vertices of each
hyper-edge are located. This makes it possible to model the communication cost
for a task-to-process mapping problem.
Figure 1.10: In this example, each vertex is placed in a different part. The hy-
peredge e0 is therefore considered to be cut. The size of the cut, fc , is 1. The
connectivity-minus-one cost, fλ , which measures the number of shared parts for
each hyperedge, is 3, because there are 4 different parts connected by e0 .
Figure 1.11: In this example, the path is cut between vertices vi and vi+1 . The
length of this path is therefore increased by D, associated with the cut.
a 4-partition, the cut cost between parts π0 and π1 is 10, and the cost between
π0 and π2 is 200. In this specific case, avoiding spreading a critical path between
parts π0 and π2 may make more sense. This heterogeneity is most often due to
the fact that the topology is not fully connected, so that additional FPGAs have
to be traversed for certain routes.
a) b)
Figure 1.12: Example of paths that are placed on a not fully connected target
topology. Placement a) generates two routing (cut) costs, while placement b)
generates two cut costs plus two additional routing costs. Placement a) is therefore
less costly than b).
28 J. Rodriguez
1. Definitions
with π(u) being the part containing u and Dπ(u),π(v) the cut penalty between part
π(u) and part π(v). If u and v belong to the same part, the cut cost Dπ(u),π(v) is
equal to zero.
The problem of partitioning red-black hypergraphs, considering the path length
and the target topology, consists in finding a partition of the vertex set H(V ) that
minimizes also the objective function fp , and that respects the capacities of each
part. Let H be a red-black hypergraph and Π a partition of its vertex set; the
function fp is defined by:
def
fp = max{dΠ (p)|p ∈ P R (H)} . (1.23)
In this thesis, we will focus on the function fp , which is the main objective of
our work. Since the number of connections between parts is limited due to physical
constraints, the cut size does have to be taken into account. However, more than
the cut size metric is needed to correctly model the effects of multiplexing on the
final delay. Signal multiplexing is a technique allowing to transfer a set of signals
between two logic elements, even if the transfer capacity per cycle is limited. This
technique results in subsequent delays, to have time to transfer all the signals. For
example, consider a capacity c between two parts. An extra delay must be used
if the number of signals between the two parts exceeds the capacity c. If an extra
delay is used, the capacity is no longer c, but at least 2c. Hence, a cut size of 2c is
as good as c + 1, while from the point of view of the functions fc and fλ , it is not.
a) b)
c) d)
e)
In the case of the directed graph model [8], only edges between the source
30 J. Rodriguez
1. Definitions
of the hyperedge and its sinks are created. The directed graph model allows
one to represent the relationships between the components of a circuit as well as
the combinatorial paths associated with the circuit. For a directed hypergraph
H = (V, A), the graph associated with H using the directed graph model will be
the graph G = (V, E) such that: E = {{u, v}|∃a ∈ A, u ∈ s− (a), v ∈ s+ (a), a ∈
A}. Figure 1.13 shows an example hypergraph and the corresponding clique and
directed graphs.
net
net net
Figure 1.14: An example of the limitations of the graph-based model for parti-
tioning a circuit to minimize cut size. In this example, (a) the circuit is composed
of 5 cells connected by 3 wires; (b) in this specific case, an optimal graph-based
partition would place cell v5 in one part and all the other cells in the other, as
this is the only two-part partition that results in only two cut edges; (c) however,
by placing cells v2 and v3 of the circuit in one part and cells v1 , v4 , and v5 in the
other, we obtain a partition in two parts, which requires only a single wire crossing
the cut. The example is taken from [176].
1.4 Conclusion
In this chapter, the notations and definitions used in this thesis have been intro-
duced. Table 1.1 summarizes the most important notations defined above, which
will be used throughout the document. However, not all notations are listed here,
only a summary of the most important ones.
We have introduced the new concept of red-black hypergraphs, which are an
extension of hypergraphs. This enriched model is necessary to consider all the
relevant properties associated with digital electronic circuits, to provide metrics
and data that are closer to reality. The ability to calculate the length of a path and
its degradation during a partitioning operation is an essential issue for minimizing
the degradation of the critical path. We have shown that previous state-of-the-art
circuit representations cannot be used to model such paths adequately.
32 J. Rodriguez
1. Definitions
Variable Definition
G = (V, E) Graph
H = (V, A) Red-black directed hypergraph
VR Set of red vertices
VB Set of black vertices
v Some vertex v ∈ V
a Some hyperarc a ⊂ V
Γ(v) Vertex set of neighbors of v
∆(H) The maximum vertex degree in H
Π Partition, with Π[v] the part of v
πi Part i, with πi [v] = 1 iff v ∈ πi and 0 else
Mi Capacity of part i
p Some path p ∈ P
dmax (u, v) Maximum distance between u and v
dmax (H) Longest path distance for H
dΠ
max (H) Longest path distance for a partitioned H Π
Dij The path cost between parts i and j
ω(Π) Cut of a partition Π
h (Π) Halo of a partition Π
λa Connectivity of hyperarc a ∈ A
Λ(H) The maximum connectivity in H
λv Connectivity of vertex v ∈ V
ΛV (H) The maximum connectivity of vertices in H
fλ Connectivity-minus-one cost function
fc Cut cost function
fp Longest-path cost function
34 J. Rodriguez
Chapter 2
35
2.1. Partitioning methods and applications
36 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
hypergraph is directly partitioned into k parts without going through the 2-way
recursive approach. Note that there are strategies that combine both, RB and
k-way partitioning.
Static graph mapping is the problem of finding a mapping between two graphs:
the source graph Gs and the target graph Gt . The number of vertices in Gt defines
the number of parts with a possible bounded capacity for each part depending on
Gt vertices. A practical application of the static mapping problem is the mapping
of the processes of a parallel program onto a parallel machine. One of objective for
the static mapping problem is to minimize the global communication bandwidth.
M. R. Garey et al. [78] have shown that static mapping is an NP-complete problem
in the general case. F. Pellegrini [152, 148] proposed an algorithm based on the
recursive bipartition of the graph Gs and Gt . All his algorithms are implemented
in the SCOTCH [149] tool. The reader can refer to several papers dealing with
static graph mapping [24, 41, 5, 97, 150, 116, 54, 22].
Mapping is an important notion for our study, since the target topology can
be a factor affecting the critical path. Modeling static mapping from the source
graph onto a target graph will help the algorithm assign vertices to parts in a way
that provides lower routing costs. In this thesis, we are not working on detailed
placement/routing on the FPGA, but on the partitioning step that precedes it.
However, a coarse-grained, platform-scale placement may allow more optimal pre-
placement of vertices in parts, with respect to algorithms agnostic to the target
topology.
A circuit can be modeled as a graph or hypergaph, that is, graph and hypergraph
partitioning is commonly used to obtain a partition of a circuit. Hence, the com-
plexity of circuit partitioning is close to the complexity of graph and hypergraph
partitioning. There exist several ways to measure the quality of a partition. Mul-
tiple cost functions have been defined in the state-of-the-art, such as fc or fλ .
Additional constraints can complete costs functions for the partitioning problem,
such as capacity limits for each part. During a partitioning process, both cost
function and constraints have to be evaluated, that is, the complexity of the pro-
cess depends on the number of constraints and the complexity of the cost function.
This section will look at different complexity results for the graph and hypergraph
partitioning problem.
A partition of a graph creates a cocycle, i.e., the set of edges (resp., hyperedges)
across multiple parts. The cocycle of a partition defines a cut, and its size define
the cut size. However, finding a minimal cut in a graph is equivalent to finding a
minimal (s-t)-cut in G. An s-t-cut in a graph G is a minimum edge separator of a
source vertex s and terminal vertex t in G. Note that an s-t-cut does not guarantee
a balanced number of vertices on both sides of the cut. Using Ford and Fulkerson’s
max-flow/min-cut theorem [77], we can solve this problem in polynomial time.
Some algorithms with an improved time complexity have been proposed to find
a minimum cut in a graph and multi-graphs [138, 90]. E. L. Lawler [121] and
J. Li et al. [126] proposed to compute an (s-t)-flow for hypergraphs to compute
the minimum cut. In the case of hypergraph partitioning, the hypergraph can be
transformed into a polynomial larger graph. When k > 2, the problem becomes
NP-hard if k is part of the input [83]. However, if k is part of the algorithm’s input
data, there is an algorithm with complexity O(nk T (n, m)) [83], where T (n, m) is
2
38 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
40 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
2.3.6 Conclusion
This section presents the various approaches that have been developed to partition
circuits for different practical purposes, such as circuit placement on a 2D surface
with critical path minimization, or circuit prototyping on a multi-FPGA platform.
Some approaches use the tools presented in the previous section in conjunction with
pre- and post-processing to drive the tool with practical constraints, such as cell
spacing for 2D placement or wire length. Other approaches use exact optimization
algorithms or metaheuristics.
Integer Programming
Linear programming was developed in the 1940s and 1950s by researchers such as
G. Dantzig [52] and L. Kantorovich et al. [53]. It has become a powerful tool for
solving optimization problems where a linear function needs to be maximized or
minimized under linear constraints.
However, in many cases, the optimal solutions to linear programming problems
are fractional numbers [84]. This was often impractical or unacceptable in domains
where variables had to represent discrete or integer values, such as resource allo-
cation, scheduling, and network design. Thus, integer linear programming became
42 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
Clustering approaches
Clustering is a form of partitioning where the number of parts is not fixed; only
the capacity for each part is limited. However, it is possible to translate from a
partitioning formulation to a clustering formulation. The only difference between
the two formulations concerns the balancing factor. In the case of partitioning,
the number of parts is associated with a maximum capacity per part, assuming
that each part is filled as evenly as possible, according to a balancing factor. In
clustering, however, only the maximum capacity of the parts is limited, but the
number of parts is not. It is therefore possible for a cluster to consist of a single
element. As electronic circuits have grown in size, they have become increasingly
challenging to manage in practice. As a result, reducing the apparent size of
instances has become a major goal for VLSI design. Clustering algorithms are one
of the approaches that have been studied to reduce the apparent size of circuits to
make them more practical for automatic processing.
E. L. Lawler et al. [122] have presented a polynomial algorithm for grouping
the vertices of a circuit so that the resulting delay is optimal. The authors use
vertex replication to reduce the number of cut paths in the circuit. A trivial way is
to replicate the circuit in each part if capacity permits, thus avoiding cutting the
critical path. Node replication makes the problem solvable in polynomial time. C.
Sze et al. [185] presented an extension the clustering algorithm to a more general
delay model. Other works have proposed clustering approaches that also take
advantage of vertex replication [157, 190].
More recently, Z. Donovan et al. [60, 61, 62] have studied the combinatorial
circuit clustering problem with and without vertex replication. They propose sev-
eral algorithms to solve this problem. The authors present NP-hardness proofs for
the DAG circuit clustering problem with minimization of critical path degradation
during the clustering step, e.g., minimization of the number of clusters along the
most critical paths. They propose exact exponential algorithms and approxima-
tion algorithms parameterized by cluster size. Further details of this work can be
found in Z. Donovan’s thesis [59].
Other work on combinatorial circuit clustering is available in these papers [145,
47]. More details about clustering methods for circuit partitioning can be found
in Chapter 4.
Multilevel scheme
Co
nt
se
ra
ha
ct
tp
io
n
en
ph
em
as
in
e
f
Re
Initial
partitioning
Figure 2.2: The multilevel scheme which consists of three phases: coarsening,
initial partitioning, and refinement.
The multilevel scheme has proven to be a very efficient methodology for par-
titioning graphs and hypergraphs. The multilevel scheme was first introduced by
S. T. Barnard and H. D. Simon [19] in 1994 for the graph bisection problem.
Independently, T. N. Bui et al. [29] used a multilevel heuristic for sparse matrix
factorization. B. Hendrickson et al. [88] developed a multilevel algorithm for graph
partitioning, and S. Hauck et al. [87] for partitioning logic circuits for VLSI de-
44 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
sign. J. Cong et al. [46] have also proposed a multilevel method for VLSI design in
which circuits are represented by graphs and a clique-oriented clustering algorithm
is designed for the thickening stage. The clique-oriented clustering algorithm fa-
vors the clustering of vertices that form cliques or semi-cliques in the graph. An
illustration of the multilevel scheme can be found in Figure 2.2.
The multilevel framework consists of three phases: coarsening, initial partition-
ing and refinement. The coarsening phase recursively uses a clustering method to
transform the considered hypergraph into a smaller one. The aim of good cluster-
ing algorithms is to try to preserve the same global structure at each clustering
level, but it is hard to achieve it in practice. During the second phase, an initial
partitioning is computed on the smallest, or coarsest, hypergraph. Then, in the
third phase, an algorithm is applied at each recursion level to prolong the com-
puted partition to the upper level, and subsequently refine it. Let us recall that
the most common algorithms used for the refinement phase are the Kernighan-Lin
(KL) [114] and Fiduccia–Mattheyses (FM) [75] algorithms. As described in the
previous Chapter, these two heuristics are based on local search to move vertices
across parts, so as to reduce the cut of balanced hypergraph bipartitions. KL se-
lects a pair of vertices from each part of a given bipartition which maximize swap
gain. Here, swap gain refers to the reduction in the number of cut edges. FM com-
putes a move gain for each vertex, and performs a single move at each iteration.
More details about algorithms integrated in multilevel schemes can be consulted
in a survey on hypergraph partitioning which has been recently produced [35].
The netlists produced by synthesis tools can be organized hierarchically, with
the overall circuit being organized as a set of functional blocks, which in turn are
recursively made up of sub-blocks. The lowest blocks in the hierarchy consist only
of basic blocks. Another strategy is to use the circuit hierarchy to define the dif-
ferent levels. Hierarchical partitioning starts with the specification of the overall
circuit, which is divided into smaller sub-circuits that can be individually designed
and optimized. The efficiency of the decomposition methods depends on the ap-
plications. H. Krupnova et al. [118] presented in their work a hierarchy-driven
circuit partitioning for large ASICs prototyping. D. Behrens et al. [20] used the
circuit hierarchy as the clustering criterion. As the circuit hierarchy is part of the
input data, the computation time of the partitioning flow is reduced compared
to k-way partitioning tools. Y. Cheon et al. [39] introduced a multilevel parti-
tioning algorithm guided by the circuit hierarchy and obtains better results than
hMetis. More recently, U. Farooq et al. [73] presented hierarchy-based circuit
partitioning for a multi-FPGA platform. Their experiments compare mono and
multi-cluster CPU circuits generated with the open source tool DSX [154]. Mono-
cluster benchmarks are mainly characterized by non-hierarchical interconnection.
Multi-cluster benchmarks are hierarchical in nature, with different clusters inter-
connected in a hierarchy. The results show that the multilevel approach produces
better results than the hierarchical approach for the mono-cluster benchmark than
for the multi-cluster benchmark. On the opposite, in their results, the hierarchical
method performs better for the multi-cluster benchmark.
Other works also use hierarchical circuit properties to partition circuits such
as those of W. J. Fang et al. [72] and, more recently, U. Farooq et al. [74].
46 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
the timing associated with the circuit. The Elmore delay for an arc a, from a net
source to one of its sinks, is defined as:
Ce
Delay(a) = Re × + Ct , (2.1)
2
with Re is the wire lumped resistance, Ce is the wire lumped capacitance, and Ct
is the total lumped capacitance of the source node of each net. From this timing,
they evaluate a subset of critical paths that are the most critical of the circuit.
The length of the critical paths is used as an hyperedge weight, to prevent these
hyperedges from being cut. The set of most critical paths is regularly re-evaluated
because it can be modified when adding cutting penalties.
S. H. Liou et al. [127], presented a partitioning flow with post-processing
and pre-processing around the hMetis algorithm to account for the performance
degradation associated with the partitioning step. The pre-processing consists in
placing the circuit on a 2D surface. The distance between vertices is used as a
weight for each hyperedge during the partitioning stage with hMetis. The aim is
to avoid cutting a pre-evaluated set of paths that are considered critical. Finally,
a post-processing step is applied to optimize the assignment of the parts on the
multi-FPGA platform. The authors also perform an optimization pass for signal
multiplexing. U. Farooq et al. [73] compared two circuit partitioning methods for
multi-FPGA platform prototyping; one is based on a multilevel approach, and the
other on a circuit hierarchy.
M. H. Chen et al. [38] proposed a partitioning algorithm, similar to hMetis,
which considers a metric associated with the distance of the vertex distance in the
hypergraph. This metric can be approximated by the eccentricity, i.e., the greatest
distance between a node and other nodes in the hypergraph. Eccentricity is used
for the coarsening step. The authors proposed a routing indicator, by calculating a
route between vertices, using an A∗ algorithm. From these indicators, the authors
propose a partitioning flow, trying to minimize the impact of partitioning on the
circuit’s performance.
D. Zheng et al. [193] presented TopoPart, a topology driven hypergraph par-
titioner designed for targeting multi-FPGA platform. TopoPart first applies an
algorithm which find candidate FPGAs for each vertex while respecting topology
and fixed vertex constraints. The coarsening step is then performed by merging
different vertices that not only have high connectivity but also share a higher
number of FPGA candidates. Next, an initial partitioning is performed on the
coarsest hypergraph to obtain a feasible initial solution. The initial partitioning
assigns first fixed vertices, if exists. Then it assigns neighbors of fixed vertices and
vertices with the lowest number of FPGA candidates. Finally, uncoarsening and
refinement steps are applied to minimize the cut size while maintaining topology
and resource constraints.
Simulated annealing
48 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
Some parameters are required to simulate the physical process and address the
problem to be optimized. These parameters are listed below:
• T the temperature
• S the solution
• ∆E energy variation
the problem of partitioning and circuit placement [132, 79, 131]. Simulated anneal-
ing is often used for the placement and routing when the circuit is partitioned. For
example, in their paper, P. Maidee et al. [130] presented a partitioning-based algo-
rithm for placement on multi-FPGAs platform. This algorithm first partitions the
circuit using hMetis [107], and then uses the VPR [23] simulated annealing-based
tool to refine the placement.
Tabu Search
Formalized in 1986 by F. Glover [81], tabu search is a local search algorithm that
maintains a FIFO queue, called tabu list, of solutions Qs that have already been
explored. Starting from a solution S, the algorithm explores the neighborhood of
S, while excluding the solutions in the queue Qs . This method avoids returning too
quickly to a solution that has already been explored. For minimization algorithms,
if there is no better local minimum cost solutions in the neighborhood of S, the
search continues by exploring higher cost solutions. This makes it possible to
escape from a local minimum.
The tabu list size must be adapted to the objective, the nature of the problem,
and the expected performance. The size of the queue has significant impact on the
computation time of the algorithm. However, it is possible to re-explore solutions
in the tabu list using the “aspiration” value. This value is an acceptance metric
based on the cost or properties of the solution. Aspiration can be used, for example,
to encourage the exploration of desired type of solutions.
S. Areibi et al. [13, 14] applied the tabu search strategy to a simulated annealing
algorithm in their papers. J. M. Emmert et al. [69] presented a two steps approach,
partitioning and placement, both using tabu search algorithm. For the partitioning
step, the algorithm minimizes the cut size. Then, for the following placement step,
the algorithm minimizes the total Manhattan distance by placing the circuit on a
grid. Other works on tabu search addressed the problem of partitioning circuits
with or without considering the target topology [128, 68, 166].
Ant colonies
M. Dorigo et al. first introduced an ant colony optimization (ACO) algorithm
[63, 42] in 1991. An ant colony algorithm is inspired from real ants’ behavior
as they forage for food sources and communicate with each other by depositing
pheromones.
The first step consists in building the solutions: the ants move along a graph
representing the problem to be solved. The ants build solutions sequentially, fol-
lowing specific rules, moving from one node to another.
An important aspect in ACO is the pheromone deposition strategy. Once
50 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
a solution has been constructed, the ants deposit pheromones on the borrowed
edges. Pheromone levels are adjusted according to the quality of the solution,
usually measured by a dedicated objective function and constraints.
Pheromone levels on edges are updated according to specific rules. For example,
edges taken by ants that have built good solutions can increase their pheromone
levels, while edges taken by ants that have built bad solutions can decrease their
pheromone levels. The steps of building solutions, depositing pheromones, and
updating are repeated over several iterations. As the iterations progress, the
pheromone levels guide the ants to favor the best-quality solutions. The algo-
rithm gradually converges to one or more local minimum solutions.
ACO is a heuristic method for combinatorial optimization problems, such as
the Traveling Salesman Problem (TSP) studied in the work of M. Dorigo et al. [64].
An ACO approach for graph bipartitioning can be found in a work done by M.
Leng et al. [124]. A k-way graph partitioning approach using ACO is studied in
work by K. Tashkova Korošec et al. [187]. There exist studies for netlist and hyper-
graph partitioning. For instance, the work by P. Danassis et al. [51] introduced a
novel netlist partitioning and placement algorithm named ANT3D, targeting 3-D
reconfigurable architectures based on ACO. More recently, R. Guru Pavithra et
al. [147] presented an ACO-based partition model for VLSI physical design.
Applying an ant colony algorithm to partition digital electronic circuits for
VLSI design permits problem-specific adaptations and custom parameters to ac-
count for domain constraints and objectives compared to min-cut tools. Comple-
mentary techniques can also be used with the ant colony algorithm to improve
partitioning performance and results [51, 147].
Evolutionary Algorithms
One subset of evolutionary algorithms are genetic algorithms (GA) [100, 99] that
were originally developed in the 1960s by John Hollande and his colleagues at
the University of Michigan. The principle of the algorithm is an iterative process
that maintains a population of solutions. Each solution is represented by a string
of digits, or chromosome. Each string is made up of characters and genes which
correspond to the digits in the string. From these digits, each solution has a
corresponding cost according to a cost function to optimize.
The heart of the algorithm is to produce multiple generations of populations,
i.e., sets of solutions. Each generation corresponds to one iteration of the al-
gorithm. During each generation, the solutions in the current population are
evaluated by a cost function dedicated to the optimization problem. Based on
these evaluations, a new population of candidate solutions is formed using specific
genetic operators: crossover and mutation. Crossover consist of combining the ge-
netic information (binary string) of two parent strings to generate new offspring.
The mutation flip an arbitrary bit of chromosome from its original state. Mutation
introduce diversity into the sampled population and it is used in an attempt to
avoid local minima.
S. M. Sait et al. [165] addressed the problem of optimizing delay, power and
cutset in the partitioning step at the physical level. In their works, the authors
presented three iterative approaches based respectively on a genetic algorithm, a
tabu search and a simulated evolution to solve the multi-objective optimization
problem of partitioning. S. S. Gill et al. [80] addressed the k-way circuit partition-
ing using genetic algorithms. J. I. Hidalgo et al. [96] proposed a genetic algorithm
for partitioning and circuit placement for multi-FPGA platforms. Their work
models the circuit as a graph and targets a mesh topology of size 4, made up of 4
FPGAs. The algorithm minimizes the number of inputs and outputs connecting
each FPGA, while preserving the circuit structure, i.e., connections and cells.
Another type of evolutionary algorithms are memetic algorithms. Memetic
algorithms combine a genetic algorithm with a local search algorithm to improve
convergence. For example, S. Areibi’s papers [12, 15] presented a genetic algorithm
coupled with two local search methods. The first local search method extends the
Fiduccia and Mattheyses algorithm [75] to k-way partitioning. The second method
is an extension of the Sanchis KFM implementation [167] that apply movement
when no further improvement exists. This improvement avoids getting stuck in
a local minimum, improving the convergence of the algorithm. Other works on
the circuit partitioning problem, based on evolutionary algorithms, can be found
in [17, 170, 18, 117, 181].
2.5 Conclusion
This chapter overviewed the current state of the art in circuit partitioning with
path length minimization. The complexity section introduced the fact that the
partitioning process is NP-hard. This means that effort must be made to develop
efficient heuristics to compute a good partition in an acceptable computation time.
Some publicly available tools have been developed, which are presented in Sec-
tion 2.3. However, these tools are dedicated to the problem of cut minimization.
This problem is relevant to us, but it is not our main objective. Relevant works are
presented in Section 2.4.1. These works are based on min-cut tools for circuit par-
titioning, that is, the authors presented kind of processing to try to model the path
costs to drive the min-cut tool. Moreover, other works presented in Section 2.4.2
are based meta-heuristics algorithms. These algorithms requires to specify good
parameters and an extra effort to find a good embedding of the problem. Hence,
these processes for using min-cut tools or meta-heuristics, are an extra effort that
is not necessary if you use a dedicated algorithm. As Abraham Maslow said:
52 J. Rodriguez
2. State of the art in circuit and hypergraph partitioning
‘If the only tool you have is a hammer, it is tempting to treat everything
as if it were a nail.”
54 J. Rodriguez
Chapter 3
55
3.1. Critical path in red-black hypergraphs
This chapter presents the methodology used to evaluate the path cost function
fp of partitions of the red-black hypergraphs.
Measuring fp means computing the critical path in some red-black hyper-
graph. The problem of computing the longest path in a hypergraph is generally
intractable, due to its NP-hardness. However, because of the properties of red-
black hypergraphs, such as the acyclicity within the DAHs that compose them,
one can compute the cost function fp in polynomial time. The algorithms needed
to compute fp are presented in Section 3.1.
In Section 3.2, we will present the digital circuits we used to compare our
proposed partitioning strategies with solutions based on min-cut partitioning tools.
We will use two sets of publicly available benchmarks and two sets created by us.
These third and fourth sets of benchmarks have been designed to contain topologies
of circuit with characteristics that differ significantly from those of the first two
sets of benchmarks.
In this thesis, we are interested not only in preventing degradation of the critical
path during partitioning, but also in the cost associated with routing signals across
parts. Therefore, it is necessary to evaluate partitioning strategies on different
multi-FPGA platform topologies. To this end, in Section 3.3, we will uses four
platforms, two of which consist of four parts, and the other two of eight parts.
56 J. Rodriguez
3. Experimental setup and methodology
path problem for red-black hypergraphs can be solved in polynomial time. Hence,
we will define an algorithm to compute the cost function fp of a partition of a
red-black hypergraph. This algorithm will be used to evaluate all the partitioning
results presented in this dissertation.
3.1.1 Hypergraphs
Computing the critical path of a hypergraph amounts to computing its longest
path. A longest path between two vertices u and v is a simple path of maximum
length between u and v. Formerly, we defined dmax (u, v) as the length of the
longest path between u and v. Given a hypergraph H = (V, E), the longest path
problem consists in finding a path p in H such that:
The longest path problem is known to be NP-hard in the general case [78].
Consequently, fp cannot be computed in polynomial time, nor can it be solved for
any cost model that operates on hypergraphs or graph representations.
However, in the next subsection, we will show that, for red-black hypergraphs,
the longest path problem can be solved in polynomial time.
58 J. Rodriguez
3. Experimental setup and methodology
follow:
d(p) > d(pmax ) − D , (3.2)
with pmax , the critical path and D a delay.
A hypergraph with few critical paths or no quasi-critical paths is unlikely to
be degraded too much if the few existing critical paths are preserved from being
cut.
Algorithm 2 is an extension of Kahn’s algorithm (topological sort) for red-
black hypergraphs. It provides the vertex sorting needed for the critical path
computation implemented in Algorithm 1.
3.2 Benchmarks
The hypergraphs of our benchmarks are taken from the ITC99 benchmark [50]
presented in Subsection 3.2.1, the Titan23 benchmark [137] presented in Subsec-
tion 3.2.2, and the Chipyard benchmark [9] presented in Subsection 3.2.3. To rep-
resent different circuit topologies, we selected a representative subset of instances
of each benchmark.
For each instance, we use topology data to define a traversal cost d(v) for each
vertex v, corresponding to the traversal time of a logic element. In order to get
realistic results, we set the cut cost to be at least one order of magnitude higher
than the propagation delay of a combinatorial cell.
3.2.1 ITC99
The ITC99 digital circuits are designed to evaluate the effectiveness of circuit
testing methods such as Automatic Test Pattern Generation (ATPG) and Design
for Testing (DFT). ATPG is a method of automating electronic design for finding
an input sequence that distinguishes a correct circuit from a faulty one. DFT
consists of integrated circuit design techniques that add testability features to the
design of a digital circuit product. The added features facilitate manufacturing
testing of the designed digital circuit. More details about this benchmark are
available in [50].
60 J. Rodriguez
Table 3.1: Characteristics of the ITC99 benchmark instances, with |V | the number of vertices, |V R | the number of
red vertices, |A| the number of hyperarcs, ∆ the maximum degree of vertices, δ the average degree of vertices ±
its standard deviation, Λ the maximum connectivity of hyperedges, and λ the average connectivity of hyperedges ±
its standard deviation. As we can see, the proportion of red vertices varies from a circuit to another, see, e.g., b02
(30/8 = 3.75) and b19 (233685/9105 = 25.66).
61
3.2. Benchmarks
62 J. Rodriguez
3. Experimental setup and methodology
Table 3.2: Paths statistics for the ITC99 benchmark, with pmax the critical path,
∼
≈ #pmax a lower bound on the number of critical paths, d(p) the median max-
imum delay of paths traversing vertices, max length(p) the longest red-red path,
i.e., the maximum number of vertices within red-red paths, and ≈ length(p) the
approximated average length of red-red paths ± its standard deviation.
∼
Instance pmax ≈ #pmax d(p) max length(p) ≈ length(p) ± std. dev.
b01 3.86 1 3.29 8 3.78 ± 1.9
b02 3.28 2 2.71 7 3.33 ± 1.89
b03 6.18 12 2.71 12 6.07 ± 3.73
b04 16.62 1 9.09 30 9.48 ± 6.91
b05 31.7 2 16.05 56 17.84 ± 14.29
b06 3.28 3 2.71 7 3.57 ± 1.86
b07 18.36 3 8.51 33 12.17 ± 8.01
b08 9.66 4 5.61 18 6.86 ± 4.38
b09 5.6 8 3.87 11 5.77 ± 2.89
b10 7.34 3 5.03 14 5.9 ± 3.49
b11 20.1 1 8.51 36 9.46 ± 7.17
b12 11.4 2 4.45 21 7.5 ± 3.93
b13 11.98 1 3.29 22 6.09 ± 4.32
b14 35.18 2 23.59 62 25.94 ± 10.96
b17 53.74 1 23.59 94 27.72 ± 20.63
b18 95.5 16 22.43 166 28.11 ± 24.91
b19 97.82 32 22.43 170 29.11 ± 26.04
b20 39.24 2 26.49 69 29.88 ± 14.68
b21 39.82 1 27.07 70 29.44 ± 14.9
b22 39.82 1 25.91 70 28.83 ± 14.47
3.2.2 Titan
The Titan benchmark consists of 23 digital circuits taken from a variety of real-
world applications. They reflect modern designs of large-scale systems and use
heterogeneous resources. The Titan benchmark was created to compare two CAD
tools, VPR [23] and Quartus II from Altera (now Intel). More details about the
Titan benchmarks can be found in [137].
In our research team, at CEA LIST, the tool used for logic synthesis is Vivado
from Xilinx (now AMD). Logic synthesis is a process that translates the behavioral
specification of a circuit, written in a hardware description language (typically,
Verilog or VHDL) into a netlist that can be implemented on an FPGA or an
ASIC. The netlist instantiates logic elements that are available on the desired
target technology; in this case, our FPGA model. It is therefore technology-
dependent. A netlist synthesized for an Altera platform cannot be implemented
on a Xilinx FPGA. Furthermore, there are several non-compatible generations of
technologies available from the same vendor. As a result, in this dissertation, we
chose to use the Virtex-7 FPGA technology from Xilinx to transform abstract
circuits of the Titan benchmark into an synthesized circuit. To define the vertex
delay, we use an approach similar to that of [127]. In this paper, S. Liou et al.
evaluated the topological depth of paths, i.e., they assigned a unit delay to each
vertex. In this dissertation, we assign a delay of 0.58 to each black vertex, which
corresponds to LUT1 traversal time for the Xilinx Virtex-7 speed grade 3. As
said previously, delays can be inferred from the targeted technology. However, in
this dissertation, we restricted our benchmark to six target topologies described
in Section 3.3.
Table 3.3: Applications of the Titan benchmark instances.
Instance Application
bitonic_mesh Sorting
cholesky_bdti Matrix Decomposition
dart On Chip Network Simulator
cholesky_mc Matrix Decomposition
des90 Multi µP system
xge_mac 10GE MAC Core
denoise Image Processing
1
A Lookup Table (LUT) is a base element on a FPGA that contains a programmable truth
table, and is used to implement combinatorial logic. LUTs can be seen as programmable (sets
of) logic gates.
64 J. Rodriguez
Table 3.4: Characteristics of the Titan benchmark instances, with |V | the number of vertices, |V R | the number of
red vertices, |A| the number of hyperarcs, ∆ the maximum degree of vertices, δ the average degree of vertices ± its
standard deviation, Λ the maximum connectivity of hyperedges, and λ the average connectivity of hyperedges ± its
standard deviation.
65
3.2. Benchmarks
Table 3.5: Paths statistics for the Titan benchmark, with pmax the critical path,
∼
≈ #pmax a lower bound on the number of critical paths, d(p) the median maxi-
mum delay of paths traversing vertices, max length(p), the longest red-red path,
i.e., the maximum number of vertices within red-red paths, and ≈ length(p) the
approximated average length of red-red paths ± its standard deviation.
∼
Instance pmax ≈ #pmax d(p) max length(p) ≈ length(p) ± std. dev.
bitonic_mesh 13.14 32 2.71 24 6.85 ± 6.44
cholesky_bdti 10.82 6 0.39 20 3.72 ± 2.84
dart 35.76 8 1.55 63 7.5 ± 6.42
cholesky_mc 10.82 18 0.39 20 3.71 ± 2.92
des90 13.14 544 2.71 24 6.95 ± 6.67
xge_mac 6.76 4 0.97 13 3.61 ± 2.17
denoise 2304.14 8 20.69 3974 540.71 ± 1047.49
These instances have a low proportion of their red vertices ending a critical
path. The circuit with the highest number of red vertices in which a critical path
ends is des90. The Titan instances, like the big ITC instances, exhibit a significant
discrepancy between the number of vertices in the longest path and the average
path size. This indicates that there is a low proportion of critical and quasi-critical
paths in these circuits. Note that circuit denoise has the highest critical path and
path length, among all circuits.
66 J. Rodriguez
3. Experimental setup and methodology
tool from CEA LIST [36], which allows to export a VHDL description of a data-
flow hardware circuit implementing a chosen neural network description.
J. Rodriguez
3.2. Benchmarks
3. Experimental setup and methodology
Table 3.6 presents some of the characteristics of the subset of our Chipyard
benchmarks used to evaluate the algorithms proposed in this thesis as well as
third-party partitioning tools. This benchmark is composed of several basic circuits
whose sizes are comparable to those of the circuits selected from the ITC and Titan
benchmarks. Note that WasgaServer is the biggest circuit of all benchmarks, with
1622238 vertices, 403886 red vertices, and 1675291 hyperarcs. Its average degree
δ is higher than that of the ITC benchmark circuits and similar to that of Titan.
Like the digital circuits in the previous benchmark sets, these circuits are sparse.
Table 3.7: Paths statistics for the Chipyard benchmark instances, with pmax the
∼
critical path, ≈ #pmax a lower bound on the number of critical paths, d(p) the
median maximum delay of paths traversing vertices, max length(p) the longest
red-red path, i.e., the maximum number of vertices within red-red paths, and
≈ length(p) the approximated average length of red-red paths ± its standard
deviation.
∼
Instance pmax ≈ #pmax d(p) max length(p) ≈ length(p) ± std. dev.
EightCore 47.94 48 5.03 84 12.08 ± 12.56
mnist 6.18 524 2.71 12 4.09 ± 3.21
mobilnet1 16.04 7 0.97 29 4.25 ± 3.76
OneCore 47.94 6 5.03 84 12.71 ± 12.46
PuLSAR 48.52 118 8.51 85 15.72 ± 10.68
WasgaServer 48.52 708 8.51 85 15.62 ± 11.14
The path statistics in Table 3.7 exhibit similar average path lengths and critical
path costs for all the selected circuits. However, the median path cost is still quite
small, with respect to the cost of the critical path.
delay and the criticality of vertices. To identify the source of each hyperarc, we
consider the first vertex in the hyperarc description to be its source. Hence, we
propose to adapt the hygr format file format to represent red-black hypergraphs.
In the file format used in TopoPart [193], the target topology and the circuit
are described. Therefore, testing N circuits onto T target topologies need N × T
files. As a consequence, we prefer to separate the description of the circuit and
the target topology in our algorithms to avoid multiplying files.
To compare our algorithms with min-cut partitioning tools, we set a single
weight for the vertices, representing resource consumption (e.g., register width),
because the hgr file format can record only a single weight per vertex. In order
to refrain min-cut partitioning tools from cutting critical hyperarcs, each hyperarc
is also weighted with the maximum criticality value of the vertices of the hyperarc
(see Section 4.2 about the r∗ weighting scheme). These values are computed using
the algorithms presented in the previous section.
70 J. Rodriguez
3. Experimental setup and methodology
Figure 3.1: Two target topologies T1 (a) and T2 (b) composed of 4 nodes.
Our third target topology, T3, is the complete 4-FPGA graph. It will be used
to evaluate critical path degradation, irrespective of additional routing costs due
to topology constraints. In this target topology, each FPGA is connected to each
other.
Figure 3.2: T4: the ICCAD 2019 contest target topology for problem B.
3.4 Conclusion
In the first section of this chapter, we presented our method for measuring the
quality of a partition for the problem of mapping a red-black hypergraph onto a
non-uniform topology. This measure is based on the computation of the longest
path within the DAHs that form the hypergraph, including cross-FPGA routing
cost. This problem is NP-hard in the general case, but, due to the acyclicity of
digital circuit combinatorial blocks, in our case, the computation can be performed
in polynomial time.
Since the red-black hypergraph partitioning problem plays an important role in
circuit partitioning, we need to evaluate different partitioning strategies on publicly
available benchmarks.
As the red-black hypergraph is a new model, defined in Chapter 1 of this
thesis, we presented in this chapter an adaptation of the hygr file format which
can encode a multi-valued red-black hypergraph. This file format is used for our
experimentations.
To perform our experimentations, we proposed six FPGA platforms, presented
in the final section. These platforms consist of three platforms comprising four
FPGAs, and three others comprising eight FPGAs.
72 J. Rodriguez
Chapter 4
73
4.1. Clustering
4.1 Clustering
In this section, we present some clustering approaches for graph, hypergraph,
and circuit partitioning. The algorithms associated with graph and hypergraph
partitioning are generally tailored to optimize cut minimization functions. In the
context of circuit partitioning, algorithms should consider additional constraints,
as well as an additional objective function aiming at minimizing the degradation
of the critical path.
74 J. Rodriguez
4. Algorithms for coarsening
4.1.4 Conclusion
In this section, we discussed the two clustering problems known as CA when repli-
cation is allowed, and CN when replication is not allowed, defined in Chapter 2.2.
In the rest of this chapter, we will only consider the CN problem.
76 J. Rodriguez
4. Algorithms for coarsening
b) Partition where H2′ is a replication of H2 and, DAHs H2′ and H5 are cut.
Figure 4.1: This figure presents an example of the use of replication to avoid cut
a critical DAH. In this example, we have two partitions of a red-black hypergraph
composed of seven DAHs in which, the hatched DAH H3 represents a DAH that
contain the critical path. Partition a cuts H3 and H5 . In partition b, H2 is
replicated in H2′ , to avoid the cut of critical path in H3 . Hence, in partition b, H2′
and H5 are cut and H3 is fully contained in part π1 .
78 J. Rodriguez
4. Algorithms for coarsening
show their limitations. Then, we present our weighting scheme based on vertex
criticality.
Delay propagation
Several previous works have proposed metrics for clustering, with the objective of
path minimization [4, 35]. For example, C. Ababei et al. [4] presented a weighting
scheme based on delay propagation to drive min-cut tools; the weight between two
vertices u and v is equal to the longest path from any red source vertex to vertices
u and v. This method calculates local weights along subpaths from red source
vertices to any vertex. Thus, within each DAH, H = (V, A):
d(u) if Γ− (u) = ∅ ,
l(u) = (4.2)
d(u) + max −
l(v) otherwise .
v∈Γ (u)
For any vertex u ∈ V , the value l(u) corresponds to the maximum path cost
from any source vertex to u. Therefore, the maximum path cost within some DAH
will be found at the level of its sink vertices. A calculation on the subpath does
not indicate whether their subpath is on the critical path. Cutting anywhere along
a path has the same detrimental effect as adding a penalty to the total path cost.
It is to alleviate these issues that the next metric has been devised.
Delay back-propagation
As all critical vertices must be labeled with the same weight, the delay propagation
scheme is not adequate. Hence, we have first devised a new weighting scheme based
on the back-propagation of path cost:
l(u) if Γ+ (u) = ∅ ,
r(u) = (4.3)
max +
r(v) otherwise .
v∈Γ (u)
For any u ∈ V , the value r(u) represents an upper bound for the path cost of the
longest red-red path traversing u. If u belongs to a path of maximum path cost,
then r(u) is equal to that path cost.
This weighting scheme accounts better for the overall impact of the cut along a
path because, unlike the previous method, the information is back-propagated to
all predecessors. However, it may include heavy vertices that do not belong to a
longest red-red path, as shown in Figure 4.2. To overcome this problem, we need
to define the value of the local critical path through each pair of vertices. For this
reason, we have proposed a third weighting system, in the next subsection.
1 2 3 4 5 5 5 5 5 5 5 5 5 5
1 3 5 5 3 4
2 5 4
4
Weighting scheme Weighting scheme Weighting scheme
if u = v ,
l(u)
∗
r (u, v) = (4.4)
r(v) −
′
max
−
l(u′ ) − l(u) otherwise .
u ∈Γ (v)
In equation 4.4, maxu′ ∈Γ− (v) l(u′ ) represents the value of the arcs along the local
critical path, which is the longest red-red path traversing v such that, for every
other l(u) < max∀u′ ∈Γ− (v) l(u′ ), arcs (u, v) are not in the local critical path. It is a
more accurate metric for improving the behavior of clustering algorithms because,
in the context of circuit clustering, the aim is to group critical vertices together. If
the relationships between vertices correctly reflect criticality, then the clustering
algorithm can take advantage of this. An example of the computation of r∗ is
represented in Figure 4.3.
For each combinatorial sub-circuit modeled with a DAH, the r∗ vertex-vertex
criticality relation defines a criticality DAG. Every hyperarc in the DAH defines a
group of arcs in the criticality DAG, in which each arc connects the source vertex
to a sink vertex. An example is presented in Figure 4.4. The cut weight of arcs
corresponds to the r∗ value between source and sink in arcs. Hence, the cut weight
of this hyperarc is the maximum of the r∗ values between its source and sinks. We
will use the criticality DAG in the next section as support for proofs.
80 J. Rodriguez
4. Algorithms for coarsening
Figure 4.3: This figure exhibits an example of the r∗ weighting scheme and how it
is computed. r∗ (ui , x) and r∗ (x, vi ) are the values of the local critical path between
pairs of vertices (ui , x) and (vi , x) in this subgraph. There is a maximum value for
each l(ui ), w. For each ui , w − l(ui ) represents the contribution of ui to the local
critical path value r(x) = max +
r∗ (x, vi ).
vi ∈Γ (x)
4.2.3 Conclusion
In this section, three weighting schemes have been defined and compared: l, r,
r∗ . As a cut anywhere along a path has the same detrimental effect of adding a
penalty to the total path cost, critical vertices must have the same criticality. The
r scheme back-propagates criticality values to the predecessors, compared to the l
scheme, which only propagates delay. Hence, in the l scheme, critical vertices do
not have the same criticality compared to the r scheme; that is, the r scheme is
better than the l scheme at identifying critical vertices. However, the r scheme can
back-propagate the critical value to non-critical vertices, while r∗ does not. Indeed,
the r∗ scheme computes each vertex’s local critical path value. In the context of
our circuit clustering problem with objective to avoid cuts along critical paths,
that is, clustering critical vertices together to avoid possible cuts along critical
paths. Consequently, the r∗ value appears to be a better model to cluster critical
vertices than l and r. The DAH clustering algorithms presented in the following
sections use the r∗ weighting scheme. We also modeled the critical relation of a
red-black hypergraph for the r∗ by a DAG. This critical DAG will be used as a
support for proofs and explanations to represent the path graph of a DAH with
its r∗ vertex-vertex criticality relationship.
82 J. Rodriguez
4. Algorithms for coarsening
Figure 4.5: In this example, the maximum number of arcs that intersect two
different paths is 2. If p is the grey path and p′ is the black dashed path, we see
that there are two arcs that validate the (u ∈ p∩p′ , v ∈ p′ \p)∨(u ∈ p′ \p, v ∈ p∩p′ )
ι condition, see Section 4.7. Hence, the size of the set is 2, i.e., ι(G) = 2.
Note that, if ∀p ∈ P (H), |P (H)| > 1, and ι(p) ≤ 1, then its associated
criticality DAG G, in undirected representation, does not have a K3 minor [159],
i.e., G is a tree. The condition |P (H)| > 1 is necessary to overcome the case of
the cycle graph Cn which contains one red-red path with a single red vertex. This
also holds for directed trees. If ι(p) = 0, the associated weighted DAG is a path
graph, a stable graph or cycle graph. Two DAHs H and H ′ are called ι−equivalent
iff max ι(p) = ′ max ′ ι(p′ ). Note that the set of paths in a DAH is the same as
p∈P (H) p ∈P (H )
in the corresponding criticality DAG. An example of the type of graph structure
as a function of ι is shown in Figure 4.6.
Proof. From Lemma 4.3.2, we show that a DAH and its corresponding criticality
84 J. Rodriguez
4. Algorithms for coarsening
n-ary tree
n-starlike tree
DAG are ι-equivalent. We show in Lemmas 4.3.3 and 4.3.4 that is it possible to
build a polynomial algorithm by calling the procedure associated with Lemma 4.3.3
if |pmax | ≥ M (pmax cannot be in one cluster); otherwise, the procedure is associ-
ated with Lemma 4.3.4.
Proof. Since all ι(p) ≤ 1, and |P (G)| > 1, G does not have a K3 minor in an
undirected representation of G [159]. Let Algorithm A1 (displayed as Algorithm 3)
be the algorithm working as follows: Create clusters by successively traversing the
vertices in a predecessor-successor order. The next chosen successor is a vertex
which is not already in a cluster, and which is connected to the current vertex
with a heaviest arc. Successors that are not chosen are placed in a FIFO queue
in decreasing order of connecting arc weights. The first vertex to be chosen is a
source with the highest outbound arc weight, and the other sources are placed
in the queue by decreasing order of their highest outbound arc weight. When a
visited vertex has no successors or all of its successors are in a cluster, the next
vertex is taken from the queue and a new cluster is created. As long as the size
constraint M is respected, the visited vertices are placed in the cluster of the
current neighbor. When the current cluster is full, a new cluster is created. The
process ends when all vertices have been placed in a cluster.
Note that Algorithm A1 works in polynomial time. Let pmax be the longest
path in P (G). Let D ≥ 1 be the inter-cluster delay, and d ≥ 1 the intra-cluster
delay. Since M ≥ |pmax |, the vertices in the longest path can be grouped into the
same cluster. Possible paths p intersecting pmax can include a subset of vertices
placed into a different cluster. In the worst case, there exists a path p in a different
cluster from the pmax cluster, such that |p| = |pmax |, and p intersects pmax . This
case appears when there are enough longest paths p so that the sum of all of their
86 J. Rodriguez
4. Algorithms for coarsening
88 J. Rodriguez
4. Algorithms for coarsening
Figure 4.7: Example of cluster reduction for two neighbor paths pu and pv with
M = 4. Two paths are considered to be neighbors iff they share at least one
vertex. Algorithm A1 produces the clustering in the left of the figure, in which:
(C(pu ) + C(pv )) × M ≥|pu | + |pv | + M = (3 + 3) × 4(=24)≥9 + 7 + 4(=20).
The procedure merge_cluster transform the cluster in the left of the figure to a
clustering in the right, and decreases by 1 the number of clusters along pv .
4.3.3 NP-Completeness
In this subsection, we extend the proof of NP-Completeness of the CN problem
to red-black hypergraphs. Z. Donovan presented a reduction of the integer set
partitioning problem to the CN problem for DAGs [59]. Based on this strategy,
we propose a similar reduction based on the structure of red-black hypergraphs.
Theorem 4.3.5. Let H be a DAH and G = (V, A′ ) be its corresponding criticality
DAG, such that ∀p ∈ P (H), ι(p) ≥ 2. Unless P = N P , there is no polynomial
algorithm to solve the problem CN <[1], M, ∆>.
Proof. Let us define and extend to red-black hypergraphs the definition of the
decision version of the CN problem previously defined by Z. Donovan, as CNdec :
The CNdec problem belongs to the NP class because it is possible to find the
critical path in a red-black hypergraph with a polynomial time algorithm [49].
Since there is no cycle within red-red paths, we can use an algorithm based on
topological sorting.
As presented in the work of Z. Donovan, we will use the same reduction strategy
from the red-black hypergraph structure, i.e., a reduction from the PARTITION
problem. Let PARTITION be the problem defined as follows: given a set of
integers I = {i1 , ..., in }, the goal is to find a partition of I into two subsets I1 ⊂ I
and I2 ⊂ I, such that:
X X
i1 = i2 . (4.8)
i∈I1 i∈I2
90 J. Rodriguez
4. Algorithms for coarsening
a)
b)
Figure 4.8: a) Example of a partition problem instance for the CNdec problem.
b) Example of a valid solution for the CNdec problem with d∗ = D, i.e., only one
cut per path is allowed.
Suppose we group the vertices of I1 with the red source vertex and those of I2
with the red sink vertex. In that case, we obtain two clusters of capacity 2 × B,
the maximum capacity associated with each cluster. Note that the value of the
critical path is equal to D = d∗ . Hence, X CN is also a valid instance of the CNdec
problem.
Now, we will demonstrate the other way, i.e., if X CN is a valid instance for
the problem CNdec , then the analogous instance X P is valid for the PARTITION
problem. If X CN is a valid instance for the problem CNdec , then a partition of
vertices exists such that the critical path is equal to d∗ = D. Note that the red
vertices cannot form a cluster. Otherwise, the critical path would be cut twice and
its cost would equal 2 × D > d∗ . Without loss of generality, let C1 be the cluster
with the source vertex and C2 be the cluster with the sink vertex, and W (C1 ) and
W (C2 ) be the sum of the weights of the vertices in clusters 1 and 2. We have:
Therefore: P
X X i
= =B= i∈I
. (4.10)
2
v∈C1 \{v R } v∈C2 \{v R }
The proof of Theorem 4.3.5, originally written for DAGs, can be found in [59].
Indeed, ι(p) ≥ 2 is a necessary condition to construct the reduction to the partition
set problem.
4.3.4 Conclusion
In this section, the iota metric ι has been introduced. This metric models a con-
nectivity cost for paths. Paths are an essential aspect for a critical path clustering
problem. We showed that when paths are strongly interconnected, it is more
complicated to cluster critical paths. However, a polynomial-time algorithm is
introduced for the case ι ≤ 1. In practice, this algorithm cannot be applied on
circuits with ι > 1 without a pre-processing modeling them by circuits with ι ≤ 1.
We showed again that the problem is NP-Complete for ι > 1, and adapted this
proof to our red-black hypergraph model. This result implies that circuits with
ι ≤ 1 cannot exactly model circuits with ι > 1.
92 J. Rodriguez
4. Algorithms for coarsening
In many cases, the propagation time of a circuit’s critical path is longer than the
time it takes to transfer a signal from one FPGA to another. However, there are
circuits for which this is not true, although they are very few. Therefore, this proof
applies only to circuits that satisfy Equation 4.4.1.
94 J. Rodriguez
4. Algorithms for coarsening
The BSC algorithm groups vertices using a direct approach based on cut capac-
ity. This makes it more practical than a recursive coupling approach. Let Solbsc (H)
be the solution produced by our BSC algorithm, presented as Algorithm 5. It can
be bounded by the worst solution. A worst-case solution is one in which each
vertex forms a cluster. Hence, we have:
Let us calculate the approximation ratio for |pmax | > M and |pmax | ≤ M .
In the case when |pmax | > M :
(|pmax |−1)×D
= M (|pmax |−r)×D+(M −1)×|pmax |×d−(M −r)×d .
By applying Equation 4.4.1, we obtain:
Solbsc (H) (|pmax |−1)×D
Sol∗ (H) ≤ M (|pmax |−r)×D+(M −1)×D−(M −r)×d
(|pmax |−1)×D
= M (|pmax |−r+M −1)×D−(M −r)×d .
Let us study the positivity of the expression DM − Dr − (M − r)d:
D(M − r) > (M − r)d, (D ≫ d)
= DM − Dr > (M − r)d .
Solbsc (H)
Sol∗ (H) ≤ M (|p(|p
max |−1)×D+DM −Dr−(M −r)×d
max |−r+M −1)×D−(M −r)×d
= M .
As |pmax | ≤ M , we have:
|pmax |
=1 . (4.18)
M
Since D
d
≤ M , we obtain:
96 J. Rodriguez
4. Algorithms for coarsening
Figure 4.9: This figure presents the effects of recursive matching vs. direct k-way
clustering. On the left is a solution produced by a recursive matching algorithm
for clustering with M = 3. On the right is the result of a direct clustering. This
example evidence that direct clustering produces less clusters, hence possibly less
cuts, than the recursive matching approach.
Let Sol∗ (H) be the optimal solution for a vertex set clustering of H. In the
best case, for a cluster size bounded by 2, the critical path will be coupled |pmax
2
|
times, which will yield the following lower bound for Solj(H):
|pmax | |pmax |
Sol (H) ≥
∗
− 1 × D + |pmax | − 1 − −1 × d . (4.20)
2 2
Let SolHEM (H) be the solution produced by the HEM scheme on our proposed
criticality DAG model. It can be bounded by the worst possible solution, in which
every vertex forms a cluster. Hence:
Let us calculate the approximation ratio for the even and odd cases of |pmax |.
When |pmax | is even:
SolHEM (H) (|pmax |−1)×D
Sol∗ (H) ≤ |pmax |
(⌈ 2 ⌉−1)×D+(|pmax |−1−(⌈ |pmax
2
|
⌉−1))×d
(|pmax |−1)×D
= |pmax | |pmax |
2 ×D− 2D2 + 2 ×d+ 2d
2
|pmax |D−D+D−D
= 2× |pmax |D−2D+|pmax |d+2d .
By applying Equation 4.19, we obtain:
SolHEM (H) |pmax |D−2D+|pmax |d+2d−D 2D
Sol∗ (H) ≤ 2× |pmax |D−2D+|pmax |d+2d + |pmax |D−2D+|pmax |d+2d
SolHEM (H)
Sol∗ (H) ≤2 .
98 J. Rodriguez
4. Algorithms for coarsening
(|pmax |−1)×D
= |pmax |+1
( 2 −1 )×D+(|pmax |−1−( |pmax2 |+1 −1))×d
(|pmax |−1)×D
= |pmax | 1
( 2 + 2 −1 )×D+(|pmax |−1− |pmax
2
| 1
− 2 +1)×d
(|pmax |−1)×D
= |pmax | 1
( 2 −2)×D+( |pmax
2
| 1
− 2 )×d
|pmax |D−D
= 2× |pmax |D−D+|pmax |d−d .
Note that the critical path contains at least 2 vertices, such that |pmax | ≥ 2.
Hence:
SolHEM (H) |pmax |D−D+|pmax |d−d
Sol∗ (H) ≤ 2× |pmax |D−D+|pmax |d−d =2
SolHEM (H)
Sol∗ (H) ≤ 2 .
Hence, the HEM algorithm, applied to our local critical path model represented
by the corresponding DAG, has an approximation ratio of 2 for the CN <[1], 2, ∆>
problem.
For M = 2, the matching algorithm behaves in the same way as the BSC
algorithm with our r∗ weighting. They have the same approximation ratio for
M = 2.
4.4.3 Conclusion
In this section, a parameterized approximation algorithm is presented. Its param-
eter is M , the cluster capacity. We prove that BSC, presented as Algorithm 5 has
an approximation ratio of M . We also prove that HEM, used with our r∗ weighting
scheme, has a 2-approximation ratio.
The BSC Algorithm 5 improves existing parameterized approximation ratios
of algorithms for clustering with path-length minimization.
100 J. Rodriguez
4. Algorithms for coarsening
HEM
350000 BSC
20000
300000
Number of clusters
15000
250000
10000
200000
5000
150000
2 4 8 16 32
100000
50000
0
2 4 8 16 32 64 128 256 512 1024 2048
M, maximum size of cluster
Figure 4.11: Comparison between the number of clusters produced by BSC and
HEM on a subset of the largest circuits, for M values ranging from 2 to 4096.
The subset is composed of Chipyard, TITAN and b14-22, all instances with |V | >
10000. Each plain line corresponds to the number of clusters produced by HEM,
and hatched lines to the number of clusters produced by BSC. Results show that
BSC produces fewer clusters than HEM. The subfigure is a zoom of b14, b20, b21,
and b22, for M values ranging from 2 to 32.
102 J. Rodriguez
4. Algorithms for coarsening
HEM algorithm. This can be explained by the fact that BSC directly groups sets of
related vertices and applies a final clustering step that tends to reduce the number
of clusters. In contrast, the HEM algorithm recursively groups vertices in pairs,
which can more easily lead to situations where there are several adjacent clusters
of size M/2 + 1 that cannot be merged.
4.6 Conclusion
In this chapter, we studied the combinatorial circuit clustering problem for delay
minimization (CN). We presented a brief state of the art in Section 4.1. The
central aspect of clustering algorithms is to select vertices to merge. Thus, the key
is to establish a good attractiveness metric between the vertices that best suits
the objective. In Section 4.2 we presented the r∗ weighting scheme, specifically
designed for this purpose.
Since the problem is NP-hard in the general case, polynomial algorithms have
been presented in Section 4.3 for a specific class of red-black hypergraphs the crit-
icality DAG representation of which is a tree. A new M -approximation algorithm
that runs in O(m · log2 (m)) time, with M being the maximum size of clusters and
m the number of hyperarcs, is introduced in Section 4.4.
Section 4.5 shows a comparison between the classic HEM algorithm, improved
with our weighting scheme r∗ , and our BSC algorithm also using r∗ . Experimental
results show that BSC improves delay length by 20% to 50% on average, on all
circuits, for many cluster sizes.
104 J. Rodriguez
Chapter 5
Initial partitioning
105
5.1. Graph and hypergraph partitioning
106 J. Rodriguez
5. Initial partitioning
Figure 5.1: In this example, two connected DAHs are represented. DAH1 and
DAH2 are connected since they share red vertices. Each striped vertex is critical.
Specifically, the red vertex in DAH1 and DAH2 is critical in both DAHs. To avoid
cutting this vertex, the exploration should continue in DAH2 even if all vertices
in DAH1 have not all been visited. Finally, this example highlights the use of
exploration through each DAH when a critical path shares a red vertex in two
DAHs.
grouped in the same part, if the capacity constraint allows for it. More details
on this approach are given in Subsection 5.2.1. Critical paths sometimes share a
red vertex, which constitutes a sink in one DAH and a source in another DAH,
as shown in Figure 5.1. In this case, continuing exploration beyond the currently
explored DAH is advantageous. For this reason, we have investigated an approach
using a dedicated depth-first search algorithm. More details on this approach are
provided in Subsection 5.2.2.
108 J. Rodriguez
5. Initial partitioning
priority queues store vertices according to their criticality value, by inserting the
most critical vertices at first. When a vertex v is processed, r(v) is the maximum
criticality of all the queues’ vertices. The queues are empty at the beginning of
the algorithm, and every insertion is performed in O(log2 (|V |)) time by using a
heap data structure.
Lemma 5.2.1. Let p ∈ P R be a red-red path, and a path order <p . For all u, v ∈ p,
if u <p v, then r(u) ≥ r(v).
Theorem 5.2.1. Let H = (V, A) be a DAH, and ov, a visit order computed by
DBFS. According to the derived breadth-first-search, it is not possible to have a
pattern va , vb , vc , with va , vb , vc ∈ V B , such that r(va ) > r(vb ) and r(vc ) ≥ r(va ).
Proof. The traversal starts from at least one source of the DAH H, so ∃vR ∈ V R
such that vR <ov va . According to Lemma 5.2.1, we have r(vR ) ≥ r(va ). Let us
assume a pattern va , vb , vc , with va , vb , vc ∈ V B , r(va ) > r(vb ) and r(vc ) ≥ r(va ).
As the vertices are black vertices of the same DAH, each of them will be inserted
in the secondary priority queue according to their criticality value, which is a
contradiction.
Theorem 5.2.1 states that the DBFS algorithm allows one to perform a walk
following the local topological order by selecting, at each step, a neighbor of max-
imum criticality. This choice allows one to favor the grouping, within the same
part, of neighboring vertices with high criticality. As long as the size constraint is
respected, every selected vertex will be placed in the same part. An example can
be found in Figure 5.2.
Proof. The algorithm performs a breadth-first search, such that vertices are visited
only once. The is_visited array ensures the following invariant: when some
vertex is visited, it is marked with the value True and is no longer processed,
as indicated by the condition at line 25. However, the algorithm has two while
loops at lines 11 and 14, which depend on two queues. The array is_visited
ensures that the vertices are only processed once, and the whole of these two loops
is executed in O(|V |) time. In the second while loop, there is a for loop at line
23. This for loop iterates over the hyperarcs of which v is a source. We assume
that each hyperarc is visited only once because there is exactly one source per
hyperarc. Every non-visited vertex in the current hyperarc is inserted in a priority
queue encoded by a heap data structure. The time complexity for each insertion is
in O(log2 (|V |)). Each vertex is inserted and processed only once in the while loop
110 J. Rodriguez
5. Initial partitioning
31: end if
32: end if
33: end for
34: end for
35: end while
36: end while
37: return Π
at line 11. Moreover, to explore the neighborhood of each vertex, each hyperarc
containing this vertex is visited. Let Λ = max{|a|, ∀a ∈ A} be the maximum
size of hyperarcs; then, the total time complexity is O(|V |log2 (|V |) + Λ|A|), with
the term |V |log2 (|V |) corresponding to the processing and insertion of vertices,
and the term Λ|A| corresponding to neighborhood exploration. Note that the
Derived-Breadth-First algorithm has an additional factor of log2 (|V |) compared
to the complexity of the Breadth-First search algorithm. This additional factor
corresponds to the management of the priority queue in DBFS, which characterizes
our algorithm, which aims to process critical vertices first.
112 J. Rodriguez
5. Initial partitioning
“[...] Note also that our particular SCMP ordering is designed with
respect to the stated problem formulation. For specific multi-FPGA
system designs, other objectives may prevail [...].”
114 J. Rodriguez
5. Initial partitioning
DBFS :
root
...
...
DDFS :
root
...
...
Figure 5.2: Examples of specific cut penalties for the DBFS and DDFS algorithms.
Penalties are represented inside boxes. All hatched nodes will be placed in the
same part. Each traversal starts from the “root” vertex. DBFS avoids multiple
cuts along paths within a DAH, while DDFS does not (see the frame in the DDFS
example). However, if a critical path starts in DAH 2 from a sink of DAH 1 , DBFS
can produce a cut along this critical path, while DDFS cannot (see the frame in
the DBFS example).
a local critical path. A local critical path is a path with a maximal criticality
value in a sub-hypergraph. According to Algorithm 7, each vertex of p′ is inserted
into the priority queue. However, the insertion is driven by vertex criticality. If
∃v ′′ such that r(v ′′ ) > r(v ′ ) then, ∃p′′ ̸= p′ with v ′′ ∈ p′′ such that p′′ is a local
maximum and p′ is not a local maximum. That is a contradiction.
⇐= Suppose that there exists a path p′ , which is a local critical path. As v ′ is
the first black vertex in p′ , r(v ′ ) is equal to the local maximal criticality. That is,
∄v ′′ such that r(v ′′ ) > r(v ′ ). That is a contradiction, because r(v ′′ ) > r(v ′ ).
As the derived depth-first search visits the vertices with higher criticality first,
v ′ will be visited after vR . That is, a path {v, vR , v ′′ , . . . , v ′ } with r(v ′′ ) > r(v ′ )
cannot exist.
Proof. The algorithm performs a depth-first search, hence vertices are visited only
once. The is_visited array ensures the following invariant: if some vertex is
visited, it is flagged as True and is no longer processed, as prescribed by the
condition at line 22. However, the algorithm has a while loop at line 11, which
depend on p_queue. The while loop contains a for loop at line 20. This for loop
iterates over the hyperarcs of which v is a source. We assume that each hyperarc
is visited only once, because there is exactly one source per hyperarc. Each non-
visited vertex in the hyperarcs is inserted in a priority queue, encoded by a heap
data structure. The complexity for each insertion is in O(log2 (|V |)) time.
Each vertex is inserted and processed only once in the while loop at line 11.
Moreover, to explore the neighborhood of each vertex, each hyperarc is visited.
Let Λ = max{|a|, ∀a ∈ A} be the maximum size of hyperarcs; then, the to-
tal time complexity is O(|V |log2 (|V |) + Λ|A|), with |V |log2 (|V |) corresponding to
the processing and insertion of vertices, and Λ|A| corresponding to neighborhood
exploration.
Note that the Derived-Depth-First algorithm has an additional factor of log2 (|V |)
compared to the complexity of the Depth-First search algorithm. This additional
factor corresponds to the management of the priority queue in DDFS, which char-
acterizes our algorithm, which aims to process critical vertices first.
116 J. Rodriguez
5. Initial partitioning
Figure 5.3: Example of a sub-DAH in which the sources are at the top, and the
sinks, at the bottom. This example displays two cone components. A cone con-
sists of a sink and all the vertices that can be reached from that sink in the
reversed hypergraph, i.e., the hypergraph whose arcs are reversed. In other words,
a cone contains all paths that end in the sink of the cone. Note that a vertex
which is not a sink can appear in several cones, as it is the case for the vertices in
the middle of this figure.
|Ci \ V R | + |Cj \ V R |
cost(Ci , Cj ) = . (5.1)
|ω(Ci )| + |ω(Cj )| − 2 × |ω(Ci ) ∩ ω(Cj )| + 1
Note that in the original formula in [171], the denominator is set to: |ω(Ci )| +
|ω(Cj )| − 2 × |ω(Ci ) ∩ ω(Cj )|. Hence, if there are only two sinks, i.e., two cones, in
the graph, the formula cannot be computed because the denominator would be
equal to 0. It is for this reason that we changed the denominator to |ω(Ci )| +
|ω(Cj )|−2×|ω(Ci )∩ω(Cj )|+1. This cost is used in [171] to measure the attraction
between two components during a cone merging process. Indeed, as we have shown
before, cones can share arcs. Hence, unmerged cones which share arcs result in
cut arcs and thus cut paths. The selection of the cones to merge must therefore be
optimized to satisfy the partitioning objective function. The authors also suggest
grouping components that share a critical path if the capacity constraint allows
for it. If a cone component is too large, the authors suggest partitioning the cone
using the minimum cut metric.
Since, in this thesis, we are interested in the objective function fp , we propose
to extend the definition of cone connected component to that of critical connected
components, according to the vertex criticality value r(v) defined in Section 1.2. A
critical cone is a cone whose size is fixed and whose vertices maximize criticality.
Hence, let c be a cone and c′ a critical cone, both originating from the same sink
v. We have the following properties:
′
c ⊂c ,
∀v ∈ c, ∀v ′ ∈ c′ ,
r(v) ≤ r(v ′ ) .
These properties ensure that, when a critical cone is computed, the critical
cone contains the vertices of the cone of highest criticality, thus maximizing the
grouping of critical paths.
118 J. Rodriguez
5. Initial partitioning
120 J. Rodriguez
5. Initial partitioning
|V | [48]. We can therefore assume that the amortized complexity of the union
operation is performed in constant time. Considering the complexity of the union,
the final time complexity of the algorithm is in O(|V | + α(|V |)|V |2 ) = O(|V |2 ).
5.2.4 Conclusion
In this Section, traversal algorithms have been adapted to the problem of partition-
ing with path cost minimization. Each of these algorithms has its own character-
istics. DBFS explores the red-black hypergraph across DAHs, which is interesting
for circuits made of large, highly connected DAHs. However, for very tightly
connected circuits, DBFS is less interesting than DDFS, which explores the hy-
pergraph in depth. In addition, DDFS traverses DAHs, making it effective when
the critical paths of multiple DAHs share a red vertex.
Finally, we propose an extension of the cone partitioning algorithm that takes
advantage of the criticality functions defined in the previous chapter. The structure
of cones provides path grouping properties that are of interest for optimizing the
fp function. In addition, this algorithm locally groups critical components, making
it more adaptable to different hypergraph than the more specific DBFS and DDFS
algorithms. However, CCP does not produce the exact expected number of parts.
To obtain the expected number of parts, CCP must be coupled with a partitioning
algorithm.
5.3.1 Model
The objective of the integer programming model is to minimize the degradation
of the critical path. Therefore, one needs to compute the maximum degradation
among all possible degradations. One also needs to model the target topology, to
Criticality of path/operation
Figure 5.4: In this example, the red-red path l in a red-black hypergraph is defined
as an ordered sequence of vertices. A set of tasks Ol is created from the path l. The
scheduling constraint models the total completion time zl of the task set Ol , i.e.,
, with an additional routing cost. This routing cost models the impact of
P
d
i∈Ol i
the partitioning/mapping solution on the completion time of path l.
consider the potentially different delays between parts. Existing cut minimization
tools do not address these two aspects of path length and target topology, since
these tools only aim at reducing the connections between parts. As this objective
is still essential in practice, we add a secondary objective to our model: minimizing
the connectivity minus one.
Also, since the paths between two red vertices do not contain any cycle, it is
possible to see the chain of black vertices in a path as a sequence of operations/tasks
i associated with some job l. Consequently, we consider scheduling constraints in
our model to minimize the impact of partitioning on the critical path. GivenP a path
(job) p = {v0 , v1 , v2 }, the critical time associated with the path equals v∈p dv . If
vertices (tasks) belonging to p are placed in different parts, then a time penalty
must be added to the total time of p. An example can be found in Figure 5.4.
122 J. Rodriguez
5. Initial partitioning
Specifying formally our problem requires a lot of definitions, which are given
in the following tables: sets can be found in Table 5.1, parameters in Table 5.2,
and variables in Table 5.3.
Set Definition
V Set of vertices.
E Set of hyperedges.
J Set of jobs.
Ol Ordered set of operations of job l, (i ∈ Ol ), where Ol1 and Oln
are the first and the last elements of Ol .
i,i′ Vertices/operation index (i, i′ ∈ V).
j, j ′ Hyperedges index (j, j ′ ∈ E).
l Job index (l ∈ J).
k Part index.
Table 5.1: Definitions of indices and sets for specifying our integer programming
problem.
Parameter Definition
n Number of vertices.
m Number of hyperedges.
hij 1 if vertex i is connected to hyperedge j, 0 otherwise.
ckr Capacity of part k for resource r.
qir Quantity of resource r, required by i.
di Propagation time of vertices (operation) i.
Dk,k′ Delay between part k and k ′ .
Wv Vertex weight.
Wa Hyperedge weight.
Table 5.2: Definitions of parameters for specifying our integer programming prob-
lem.
The specification of our integer program, which is presented below, aims at ful-
filling two objectives: 5.2a for critical path minimization, and 5.2b for connectivity
Variable Definition
xik 1 iff vertex i is mapped onto part k, 0 otherwise.
yjk 1 iff hyperedge j has a vertex placed on part k.
zl Completion time of job l.
zmax Maximum completion time of jobs.
Table 5.3: Definitions of variables for specifying our integer programming problem.
cost minimization:
subject to : (5.2c)
X
xik = 1, ∀i , (5.2d)
k
hij xik ≤ yjk , ∀i, j, k , (5.2e)
X
qir xik ≤ ckr , ∀k, r , (5.2f)
i
X X
di + xik xi′ k′ Dkk′ ≤ zl , ∀k, k ′ , l , (5.2g)
i∈Ol i,i′ ∈O l
zl ≤ zmax , ∀l , (5.2h)
xik , yjk ∈ {0, 1}, zl ∈ N ∀i, j, k, l . (5.2i)
(5.2j)
Constraint 5.2d states that each vertex is mapped onto one part. Constraint
5.2e guarantees that yjk equals the connectivity cost associated with hyperedge j.
Constraint 5.2f ensures the capacity constraint is respected. Constraints 5.2g and
5.2h determine the value of the delay of the job (path) and the maximum delay
(critical path). Constraint 5.2i enforces the non-negativity and integrity conditions
on the variable.
5.3.2 Symmetries
There are symmetries in the solution space of hypergraph partitioning for cut size
minimization. Indeed, in the plain partitioning case, if ω hyperedges span across
parts, ω remains unchanged regardless of the labels of the parts. However, in our
124 J. Rodriguez
5. Initial partitioning
a) b)
problem, we are trying to minimize the path cost, which is degraded by routing
paths across parts that are not always fully connected. An important point in
modeling the partitioning problem is to break the symmetries. For example, P.
Bonami et al. [25] propose models for the partitioning problem without symmetries.
However, these constraints are too restrictive for the solution space associated
with path cost, because, in our problem, the penalty between two parts is not
homogeneous. Indeed, the target topology defines a time penalty associated with
path routing. From a routing point of view, we cannot consider all partitions
with the same subset of vertices but different labels, as a symmetry, an example
is provided in Figure 5.5. Note that some symmetries still exist. For example, if
we take the partition a shown in Figure 5.5, it is possible to create a partition a′
by swapping the vertices of π0 and π3 and those of π1 and π2 .
the instance size. During the coarsening phase, the structure of the red-black
hypergraph can change, i.e., extra connections between vertices can be created.
As a result, the coarsened red-black hypergraph may have a modified critical path
that does not exist in the original red-black hypergraph.
Initial Coarsened
Initial Coarsened
Figure 5.6, presents such an example in which the value of the critical path
in the coarsened red-black hypergraph is at least equal to the critical path in the
126 J. Rodriguez
5. Initial partitioning
original hypergraph. In other words, the criticality of the paths in the coarsened
red-black hypergraph is an upper bound of that in the initial hypergraph.
However, the new paths can be extremely approximated and may disrupt the
initial partitioning step because integer programming optimizes job sequences
which are biased by coarsening. There are several ways to overcome this prob-
lem. The first one is to create a new data structure based on the routing table.
This solution creates a model with a much larger memory footprint. Another so-
lution is to calculate a degradation tolerance factor for paths during the clustering
phase. The tolerance factor may be used to model the additional cost of creating
new paths. It might be interesting to experiment and measure the effectiveness
of these two strategies in combination with our integer program as the first par-
titioning step. Maintaining the correctness of the path information during the
initial partitioning would allow one to take advantage of an exact solver of a linear
programming model.
5.3.4 Conclusion
In this section, we introduced an integer programming model, integrated into an
exact approach, for solving concurrently the two problems of cut and path length
minimization. Since circuits can be large, a coarsening step must be performed
upstream to reduce the number of variables in the model. However, we have
shown that the coarsening step can create erroneous paths when the vertices are
merged. We have proposed strategies that may overcome this issue, but path
modeling remains complex. Because digital electronic circuits can be very large,
path information can be very degraded by the coarsening process. As a result, the
initial solution produced by the linear program is likely to be approximate, which
some how contradicts the purpose of this tool.
and signal capacities between FPGAs. Their algorithm swap parts over the FPGA
platform to optimize placement.
It is possible to compute an “exact” solution for placing a partitioned hyper-
graph. First, the partitioned hypergraph is viewed as a graph in which each vertex
correspond to a part. Then, for example, the integer programming model can be
re-used again to move parts on the target topology, according to timing objectives.
Finally, if the number of parts is too large, a heuristic based on the approach pre-
sented by S. Liou et al. or a refinement algorithm based on swapping of parts,
may prove effective. The refinement algorithms then take over to optimize the
placement at each level.
128 J. Rodriguez
5. Initial partitioning
130 J. Rodriguez
5. Initial partitioning
the path information becomes less accurate. Therefore, for this method to be
applicable to larger circuits, the coarsening step needs to be adjusted.
5.5.2 Results for DBFS, DDFS and CCP with min-cut tools
In this subsection we present results on circuits benchmarks ITC, Chipyard and
Titan, targeting the 6 topologies described in Chapter 3. We compare both DBFS,
DDFS and CCP with min-cut tools hMetis, khMetis, PaToH, KaHyPar, and
TopoPart.
For our experimentations, we set the hMetis and khMetis balance param-
eter to 5% and the number of runs to 10. For coarsening, we chose the heavy
edge strategies to take advantage of our weighting based on vertex criticality. We
chose to use PaToH in its default version to see whether it could produce accept-
able solutions in reasonable time, because computation time is a critical aspect of
industrial-size circuit prototyping. We set the vertex visit order (VO) to maximum
net size sorted mode and matching type (MT) to heavy connectivity clustering.
This vertex visit order favors the processing of the largest hyperarcs first. In ad-
dition, we set the partitioning algorithm (PA) to greedy hypergraph growing with
(dΠ
max (H) − dmax (H))
τ (H Π ) = . (5.3)
dmax (H)
Each figure presents critical path degradation relative to the best degradation
produced. A τ (H Π ) is equal to 0 when dΠ max (H) = dmax (H) and τ (H ) is equal to
Π
1 when dΠ max (H) = 2×dmax (H). Therefore, τ (H )+1 is equal to the multiplicative
Π
coefficient of the value of the critical path dmax (H), that is:
dΠ Π
max (H) = (τ (H ) + 1) × dmax (H) . (5.4)
Each critical path degradation is sorted from the best to the worst, that is, each
curve shows how much the algorithm is the best and how much it is the worst.
We decided to sort the relative degradation for each algorithm to show how
much these algorithms can produce a small or high critical path degradation. In
addition, the sorted degradation defines a curve that represents the algorithm’s
performance over the entire benchmark. For example, to determine whether an
algorithm produces less degradation than the others, we need to look at the po-
sition of its plot in relation to the others. Thus, the further a plot associated
with an algorithm is below the other plots, the better that algorithm performs.
As a result, we can observe the increasing degradation over the whole benchmark
for each target topology. Results on fully connected T3 and T6, K4 and K8 , are
presented in Figures 5.8 and 5.13. However, these plots cannot compare a result
of an algorithm to a specific circuit to another one easily, but detailed results can
are available in Appendix A.2.
In Figure 5.9 and Figure 5.10 allow one to compare critical path degradation
between algorithms DBFS, DDFS and CCP with respect to min-cut petitioners, for
a 4-partitioning. For these two topologies, DDFS produces fairly high degradation.
However, if we look at Figure 5.8 presenting result for a fully connected target
topology, DDFS produces better results, as at least it does not perform worse
than KaHyPar, PaToH, and DBFS. We can conclude that the poor results on
topologies T1 and T2 are due to routing penalties induced by the notion of distance
in the model.
Figure 5.8 allows us to estimate routing-related degradation for other topolo-
gies. Remember that none of these algorithms take the target topology into ac-
count. The results of Figure 5.8, show that CCP produces the least worst-case
132 J. Rodriguez
5. Initial partitioning
degradation compared to the other algorithms while for both, topologies T1 and
T2, not. We can conclude that CCP groups critical paths into cones and then
merges them to limit the cut of critical or semi-critical paths. However, the degra-
dation suffered on other topologies indicates that the resulting partition is less
robust to routing than other partitions. This problem arises from the order in
which the parts are assigned, from 0 to k − 1. Indeed, this order does not depend
on the target topology. Note that DBFS and DDFS have the same part assignment
process.
DBFS and DDFS provide different results for a part of circuits; for example,
for circuit b05, DBFS produces a degradation close to 1 while DDFS produces
a degradation close to 3, but, for mnist circuit, DDFS produces a degradation
close to 4, which is better than the one produced by DBFS, close to 6. These
differences highlight the importance of the choice between DBFS and DDFS for
circuit partitioning. Note that for b05, CCP produces a degradation close to 2,
which is between DBFS and DDFS. Hence, when circuit properties are unknown,
we can compute partition from DBFS, DDFS and CCP, and select the best one.
Comparing the results obtained on Figure 5.13 against those of Figure 5.11 and
Figure 5.12, one can see that CCP produces the lowest degradation. In the results
regarding the T4 and T5 topologies, CCP produces the smallest degradation, but
is not the best overall degradations. Whereas CCP is the best for a majority of
circuit instances on the fully connected topology, with a maximum degradation of
about 9 (10 × dmax (H)), one can see that PaToH produces also good results on
topology T3 with a maximum degradation of about 9. In comparison, the next
bests, khMetis and hMetis, yield a degradation close to 12 (13 × dmax (H)). It
is worth noting that the greatest degradation occurred on the ITC circuits, which
are smaller in size. In fact, since there are few paths, the critical path is very often
cut, resulting in significant critical path degradation.
14
12
Logarithm of the execution time in ms (y = log(t))
12
10
8
6
6
4
4
2
2
0
134 J. Rodriguez
hmetis
20 khmetis
patoh
kahypar
topopart
5. Initial partitioning
15 dbfs
ddfs
ccp
10
135
136
30 hmetis
khmetis
patoh
25 kahypar
topopart
dbfs
20 ddfs
ccp
15
10
0
Circuit instance
Figure 5.10: Results of degradation produced by hMetis, khMetis, PaToH, KaHyPar, TopoPart, DBFS,
DDFS and CCP when mapping onto architecture T2. Note T2 is a form of a path with 4 FPGAs, hence, T2 causes
more critical path degradation due to its topological structure. For this topology, hMetis, khMetis, KaHyPar
and DBFS seem to perform better than the others. PaToH performs better for 3 circuits, but its degradation
reaches a value of 5 for half of circuit benchmarks, compared to hMetis, khMetis, KaHyPar and DBFS, each
below a value of 5 for 60% of circuits. Note that TopoPart end with higher degradation while all the others
J. Rodriguez
5.5. Experimental results
30 dbfs
ddfs
ccp
20
137
results on multiple additional routing cost in small critical path, resulting in large degradation.
138
40 hmetis
khmetis
35 patoh
kahypar
30 topopart
dbfs
25 ddfs
ccp
20
15
10
J. Rodriguez
5.5. Experimental results
5. Initial partitioning
The results for DBFS and DDFS presented in Figure 5.13 evidence the same
behavior. It is important to note that DBFS performs better overall than DDFS.
For example, b10 has a degradation of 6 with DBFS, versus 11 with DDFS; b12
has a degradation of 5 with DBFS, versus 8 with DDFS. Conversely, the mnist
circuit has a degradation of 5 with DDFS, versus 7 with DBFS. In line with what
was presented in Section 5.2 on the differences between DBFS and DDFS, these
examples show that, in practice, there are circuit examples where one of these two
strategies is more advantageous than the others.
ω(Π)
τλ (H Π ) = , (5.5)
min {ω(Πalgo )}
algo∈ALGO
with ALGO being the set of algorithms to compare. We chose to present this
cost on a logarithmic scale because DBFS and DDFS typically produce partitions
whose connectivity-minus-one can be one order of magnitude higher than the best
cost, resulting in large relative costs. As a results, log(τλ (H Π )) = 0 indicates a
relative cut cost close to the best cut cost with τλ (H Π ) = 1.
In this set of algorithms, only TopoPart computes a partition while taking
into account the target topology. Algorithms that take the target topology into
account will not produce the same partitions for each topology, unlike the others
mincut tools. For this reason, we decided to evaluate the algorithms on topologies
that are fully connected for connectivity-minus-one cost. Additional results on this
metric on other target topologies can be found in the Appendix A.2.
Figure 5.14 and Figure 5.15, hMetis, KaHyPar, and khMetis produce par-
titions with best connectivity-minus-one compared to TopoPart, DBFS, DDFS
and CCP. We note that PaToH yields good connectivity-minus-one costs for
half of the circuit instances. Tables of numerical results can be consulted in Ap-
pendix A.2.
16
14
Logarithm of the execution time in ms (y = log(t))
14
12
10
10
8 8
6
6
4
4
2
2
0
140 J. Rodriguez
8
hmetis
khmetis
7
patoh
kahypar
6 topopart
dbfs
5. Initial partitioning
5 ddfs
ccp
4
0 5 10 15 20 25 30
Circuit number
Figure 5.14: Logarithm of connectivity-minus-one relative to the best results when mapping onto architecture T3 for
hMetis, khMetis, PaToH, KaHyPar, TopoPart, and greedy path cost algorithms: DBFS, DDFS and CCP.
We notice a break between mincut and greedy algorithms. However, TopoPart behaves like DBFS, DDFS, and
CCP, while PaToH produces good results for half of the circuits. DBFS produces worst connectivity-minus-one
141
results.
142
hmetis
khmetis
7
patoh
kahypar
6 topopart
dbfs
5 ddfs
ccp
4
0 5 10 15 20 25 30
Circuit number
Figure 5.15: Logarithm of connectivity-minus-one cost when mapping onto architecture T6 for hMetis, khMetis,
PaToH, KaHyPar, TopoPart, and greedy path cost algorithms: DBFS, DDFS and CCP. As for target topology
T3, one can see a break between mincut and greedy algorithms. Similarly as on T3, TopoPart behaves like DBFS,
DDFS, and CCP, and PaToH produces good results for half of the circuits. DBFS and DDFS produces worst
J. Rodriguez
5.5. Experimental results
connectivity-minus-one results.
5. Initial partitioning
5.6 Conclusion
In this chapter, we introduced two initial partitioning algorithms, DBFS and
DDFS, based on graph traversal, for the problem of partitioning with path cost
minimization. The aim of these two algorithms is to favor the grouping of critical
vertices in order to avoid cuts along critical paths. DBFS explores the red-black
hypergraph from DAH to DAH, which is interesting when DAH are highly con-
nected. However, for very sparse DAH, DDFS tends to perform better than DBFS
because DDFS explores the hypergraph in depth.
We also presented the CCP algorithm, which is an extension of the cone par-
titioning algorithm, to tackle the partitioning of red-black hypergraphs with path
cost minimization. The structure of a cone provides path grouping properties that
are of interest for optimizing the fp function. In addition, this algorithm locally
groups critical components throughout the hypergraph, making CCP more adapt-
able to various hypergraph topologies than the more specific DBFS and DDFS.
However, CCP does not produce an expected number of parts. To get the expected
number of parts, CCP must be coupled with a partitioning algorithm.
As integer programming algorithms can be interesting to provide an exact
initial partition, we introduced in the previous section our approach based on an
integer program. This integer program takes advantage of scheduling constraints
to place paths according to the target topology, as opposed to DBFS, DDFS and
CCP.
In order to optimize the partition produced by DBFS, DDFS and CCP, in
Section 5.4, we made some suggestions for mapping the initial partition.
Experimental results show that the DBFS and DDFS algorithms are relevant
and complementary initial partitioning methods, depending on circuit instances
and their underlying topologies. These algorithms seem to be a good approach for
the prototyping of circuits on a multi-FPGA platform. However, these methods
tend to degrade cut size. We presented results of DBFS, DDFS and CCP with
min-cut tools on all circuit benchmarks. These results show that routing costs on
the FPGA platform has an impact on our algorithms. In fact, for fully connected
topologies, CPP yields best results, and DBFS and DDFS produce overall good
results, no worse than min-cut tools. These algorithms do a good job in capturing
critical paths in placing them in a single part whenever possible, unlike min-cut
algorithms that focus solely on cut size.
Results of applying exact solver to our integer program, used as initial parti-
tioning inside a multilevel scheme, show that our approach is better at minimizing
fp than a min-cut partitioning tool, even if it is oriented towards minimizing the
sum of the criticality of hyperarcs cut. However, when the hypergraphs are large,
the coarsening step over-approximates the paths, resulting in an initial partitioning
of lower quality.
144 J. Rodriguez
Chapter 6
Refinement algorithms
145
6.1. Refinement algorithms
This chapter discusses the refinement algorithms that we studied in the context
of this thesis. These algorithms are part of the third step of the multilevel scheme:
uncoarsening and refinement. Refinement algorithms aim to improve existing par-
titions. Given a hypergraph H and a partition Π, refinement algorithms aim to
improve Π by moving vertices of the frontier (or halo), i.e., whose hyperedges do
not have all their vertices in the same part. In general, refinement algorithms
calculate a gain associated with a vertex move. The vertices of highest gain, that
is, those which improve the objective function most, are moved first. Some algo-
rithms also accept vertex moves with a negative gain, that is, which degrade the
current solution, in order to evade from a local minima and improve the search on
the solution space [75, 114]. Other refinement algorithms exist, such as approaches
based on computing a minimum separator from a max-flow algorithm. P. Senders
et al. [168] presented a refinement method based on a max-flow algorithm for
the graph partitioning problem. T. Heuer et al. [94] adapted this algorithm to
hypergraph partitioning. In our hypergraph partitioning context, computing a
separator of minimum cut does not necessarily improve the solution quality. We
are interested in refinement algorithms based on local search, such as KL [114] and
FM [75], because these algorithms have proven their effectiveness and ability to
adapt to different objective functions. KL, introduced by N. W. Kerninghan and
S. Lin, explained in Section 6.1.1, and FM, introduced by C. M. Fiduccia and R.
M. Mattheyses, is described in Section 6.1.2. These two approaches inspired our
Delay K-way FM refinement algorithm, presented in Section 6.1.4. Some sections
of this chapter are based on our published work [160].
146 J. Rodriguez
6. Refinement algorithms
Figure 6.1: Example of partition status after the refinement process. Here, the
plain and the dotted lines are the current and desired frontier of the partition,
respectively.
Π. In the KL algorithm, each vertex is associated with an integer value gain1KL (u)
which evaluates the impact on the cost function of the movement of the vertex
from its current part to the other part. The gain for moving vertex u from part
π0 to part π1 is defined by:
def
X X
gain1KL (u) = We (u, v) − We (u, v) . (6.1)
v∈Γ(u)∩π1 v∈Γ(u)∩π0
The first term evaluates the sum of the weights of the formerly cut edges that will
no longer be cut after the move, while the second term evaluates the cost of the
edges that will be cut by the move. Since the KL algorithm is applied to the graph
bisection problem, i.e., a balanced bipartition, moving a vertex from part π0 to
part π1 implies the move of another vertex from part π1 to π0 , so as to maintain
the balance of the partition. Therefore, one can compute a gain per pair of two
swapped vertices u, v, defined by:
def
gain2KL (u, v) = gain1KL (u) + gain1KL (v) − 2 × We (u, v) . (6.2)
148 J. Rodriguez
6. Refinement algorithms
34: gain∗ ← −1 ▷ Initial value for the best gain, i.e., -1 enforce to choose the
first move with a positive gain
35: gain′ ← 0
36: for i ∈ |moves| do ▷ Each move is simulated
37: gain ← gain + gain(moves(i))
′ ′
At the end of the loop, all swaps have been simulated, i.e., all vertices of π0 are
now in π1 , and all vertices in π1 are in π0 . Finally, each swap is processed in the
order in which it was inserted, i.e., by descending order of gain. The gain can be
negative if the swap increases the size of the bisection. Each gain is added to a sum
of gains which measures the quality of the sequence of exchanges between [0, i∗ ],
where i∗ is the index of the last swap for which the sum of gains is maximized.
The search continues for all swaps, so as to overcome a local maximum. Finally,
the swaps between [0, i∗ ] are applied, and the procedure is repeated until the cost
of a pass is positive. The worst-case complexity of the algorithm is in O(V 3 )
time [114]. However, S. Dutt et al. [65] presented an improvement in execution
time, in O(|E| max(log(|V |), ∆)).
function gainFM (v, π1 ) associated with the moving of v to part π1 is defined by:
def
X X
gainFM (v, π1 ) = We (e) − We (e) . (6.3)
e∈E,∀u̸=v∈e∩π1 s.t. v∈e e∈E,∀u̸=v∈e∩π0 s.t. v∈e
The gain of a move can be negative or positive. Let λ(v) = |{e|v ∈ e, e ∈ E}|
be the connectivity of vertex v, i.e., the number of hyperedges connected to this
vertex. In the electrical engineering literature, vertices are called “pins”, hyperedges
are called “nets”, and the connectivity value λ(v) is the number of nets containing
some pin. Let ΛV = max{λ(v), v ∈ V } be the maximum degree in the hypergraph.
It is possible to bound the gain function for any vertex v by [−λ(v), +λ(v)] and
in the general case, by [−ΛV , +ΛV ]. Consider the following example, with all
hyperedges connected to v have all their vertices also in part π. If v moves to
another part, then λ(v) new hyperedges will be cut. This is the worst case, with
a gain equal to −λ(v). In the opposite case, when v is in one part and all its
neighbors, i.e., the vertices contained in hyperedges incident to v, are in another
part π, moving v into part π will yield a gain of +λ(v). Vertices are chosen each
time by taking the vertex of best gain. If a partition is considered as unbalanced,
moves that rebalance it are allowed, even if their gain is negative. Allowing moves
with negative gain may allow the algorithm to escape from local minima.
At the end of the pass, the balanced partition with the best cost is selected.
As with the KL algorithm, each vertex can only be moved only once during each
refinement pass. In the FM algorithm for non-weighted hypergraphs, a bucket-list
(BL) data structure stores vertex gains. This data structure maintains two arrays
of size 2ΛV , in which each cell i is linked to a doubly-linked list containing the
vertices of gain i. The first array stores the vertices that are susceptible to move
from π0 to π1 , and the second array stores the vertices that are susceptible to move
from π1 to π0 . If a vertex v is connected to gain x of BL(π0 ), its gain for moving
to π1 is x. The authors show in their work [75] that the run-time complexity of
the FM algorithm is in O(ΛV ) operations per pass using this data structure.
150 J. Rodriguez
6. Refinement algorithms
of their lower complexity and ease of implementation. The k-way refinement al-
gorithm generally follows the same pattern as the 2-way algorithm described in
the previous section. Several works [6, 16, 112, 167] have proposed extensions
of the FM refinement algorithm to KFM, “K” standing for k-way partitioning.
Readers interested in the different implementations of variants of the Fiduccia and
Mattheyses’ algorithm can e.g., refer to the work of Ü. Çatalyürek et al. [35].
def ′
gainDKFM (Π, Π′ , H) = dΠ Π
max (H) − dmax (H) . (6.4)
When calculating the gain, the movement of a vertex from one part to another is
simulated, to calculate the effect of this movement on the value of the critical path.
The recalculation of the critical path has a complexity in O(|V |) time. Indeed,
if the vertex is red, the algorithm computes the critical path in all concerned
DAHs. In the worst case, the red vertex connects to all DAHs, which explains
this maximum complexity in O(|V |). When a vertex is moved from one part to
another, the critical path update can occur throughout the DAH. For example, in
Figure 6.2, the critical path of the Π partition is pa . The cost of the Π partition is
therefore dΠmax (H) = d(pa ). Moving vertex v to part π0 seems to be a good choice,
as it reduces the cost of the critical path. Π′ is the result of moving v to part
π0 . Since pa is no longer routed to π1 , path p′a is such that d(p′a ) < d(pb ). Unless
′
we calculate the effect of partition Π′ on the red-black hypergraph dΠ max (H), one
cannot obtain the value of the new critical path locally, or by a local calculation
on the neighborhood of v, or of pa .
The problem addressed in classical partitioning algorithms is the minimization
of the cut size fc , or connectivity cost fλ . These functions are defined by a sum
of local hyperedges cut/connectivity cost. Hence, if a hyperedge is cut, then, it is
accounted for in the sum of cut hyperedges. Hence, because of the associativity
property of a sum, if a cost of one hyperedge change, fc and fλ cost functions do
not need to recompute the cost of each hyperedge to be evaluated.
max max
Figure 6.2: Illustration of what is happening when a vertex is moved from part π1
to π0 . In this example, pb , a path in the hypergraph, may become the new critical
path.
152 J. Rodriguez
6. Refinement algorithms
def
ς(HΠ , u, v) = ρ(dΠ Π
max − dmax (u, v))/DΠ[u]Π[v] , (6.5)
with u and v being the source and sink vertices of some hyperarc, and ρ a cut
tolerance. A cut tolerance is a parameter which tunes the degradation capacity ς.
Hence, if ρ is equal to zero, each degradation capacity is equal to zero. In this case,
the critical path is evaluated for each modification of the partition. When ρ > 0,
dmax (H) will be updated if ς(H Π , u, v) ≤ 0. The value of ρ drives the quality of
the gain computation during a DKFM pass.
The description of a refinement pass is provided as Algorithm 10. The first step
consists in randomly selecting a part from which a vertex can be moved without
overflowing the part size. Consequently, the algorithm can rebalance unbalanced
partitions at the start of processing, by encouraging the movement of vertices from
overloaded to underloaded partitions, even if the resulting gain is negative.
Once at line 2, part k ′ is selected and the vertex to move is retrieved from the
list of moves associated with part k ′ . The vertex with the best gain is selected
using an array that stores the best gain for each part. The following lines, from
line 3 to line 9, move the vertex v from k to k ′ . The new critical path associated
with the new partition i′ is calculated at line 9. If the new part k ′ exceeds the
maximum capacity Mk′ , then part k ′ is removed from the candidate parts, and
no vertex is added to part k ′ until the capacity constraint is met again. Finally,
we update the current best solution if the new critical path length is lower than
the previous best solution. Each vertex that is moved is locked, to avoid multiple
moves or back-and-forth movement.
The for loop at line 21 performs the gain update processing on the neighbor-
hood of v. We only calculate the gain of a move from v ′ neighbors to the new part
of v, k ′ . If the neighbors of v are no longer connected to the old part of v, part
k, then k is removed from the set of neighboring parts B(v ′ ). Moves of neighbors
v ′ that are no longer connected to part k are also removed. Finally, we calculate
the gain of neighboring vertices for a move to k ′ . If vertices were not connected
to part k ′ , then part k ′ is added to set B(v ′′ ) and the gain of a move of v ′′ to
k ′ is calculated. We will not worry about the other parts, as they have already
been calculated. In the case where the critical path value has changed, i.e., in the
case of a deterioration or an improvement due to the moving of v in part k ′ , the
critical path value has to be updated according to the capacity degradation ς. It
is assumed that the gains will still be locally accurate and will reduce the local
critical path degradation.
In addition, the DKFM algorithm is called at each level during the uncoarsening
stage. As a result, the gains are recomputed for each level. If a critical path
changes, it will be processed at the next level. The complete algorithm in pseudo-
code is shown as Algorithm 11. The data structure used in the DKFM algorithm
154 J. Rodriguez
6. Refinement algorithms
33: end if
34: if Π(v ′ ) ̸= Π(v) then
35: k ← Π(v ′ ) ▷ Here, k is the new part of v
′
36: Π ← Π \ {πk , πk′ }
37: πk ← πk \ {v ′ }
38: πk′ ← πk′ ∪ {v ′ }
39: Π′ ← Π′ ∪ πk′ , Π′ ← Π′ ∪ πk
′
40: gain ← pmax − dΠ max (H)
41: ςv′ ← ρ(pmax ) − r(v ′ )/Dkk′
42: if |B(v ′ )| = 0 ∧ v ′ ∈ / πk′ then ▷ v ′ goes to the halo
43: B(v ′ ).add(k ′ ) ▷ k ′ goes to the neighbors parts of v ′
44: is_locked[v ′ ].add(v ′ )
45: else
46: remove(vertex_pointer, v ′ , k) ▷ Delete the old gain of moving
v ′ to k ′
47: end if
48: moves[k ′ ][gain].insert(v ′ , ςv′ ) ▷ Insert the new gain of moving v ′ to
′
k
49: end if
50: end if
51: end for
52: return Π
... ...
- gain + gain
...
part
number
...
...
vertex_pointer : best_gain
... ...
...
...
...
...
part
... ...
number
...
...
...
...
... ...
156 J. Rodriguez
6. Refinement algorithms
is similar to that used in the FM algorithm. The data structure consists of a gain
array, where each cell represents the value of a gain, bounded by the minimum
gain and the maximum gain. Each cell in the array contains a doubly-linked list of
vertices to be moved to the said part, with a gain corresponding to the cell index
in the array. Such an array is constructed for each part πk . In addition to this
structure, the array vertex_pointer associates the vertices with their pointers in
the doubly-linked list, to find them more efficiently. The best_gain array stores
a pointer to the best gain for each part. The best_gain array gives access to the
best moves in O(1) time. An example of this data structure for some part πk is
shown in Figure 6.3.
Lemma 6.1.1. One pass of the Delay K-way Fiduccia-Mattheyses local search
Algorithm 10 runs in O(|V |2 ), with |V | being the number of vertices.
Proof. The algorithm chooses a part in which to make a move. A vertex of best
gain is selected, and a gain update is calculated for its neighborhood. The algo-
rithm performs assignations and reads from lines 1 to 9 in O(1) time. At line 10,
the computation of the new value of the critical path applies to the entire red-black
hypergraph, in the worst case. This results in a computation time in O(|V |). The
for loop at line 22 iterates over the entire neighborhood of the moved vertex, v.
In the worst case, |Γ(v)| = |V | − 1. In this loop, we perform a second traversal
of the neighbors of v and a critical path calculation at line 43, evaluating the new
gain of each vertex. Since the calculation of the critical path and the traversal of
the neighbors are, in the worst case, in O(|V |) time, and |Γ(v)| = |∆| − 1, we get
a complexity in O(∆|V |) time. The operations of reading, inserting, and deleting
the structure of the moves are performed in constant time using a doubly-linked
list.
158 J. Rodriguez
6. Refinement algorithms
13 makes a number of calls to the DKFM pass described as Algorithm 10. The
complexity of this pass is in O(|V |2 ) time, as seen in Section 6.1.1. Since the while
loop is repeated N times, the complexity for the Delay K-way Fiduccia-Mattheyses
local search Algorithm 11 is in O(|V |2 + |V |2 + N × |V |2 ) time. When N is set
in O(log2 (|V |)), the complexity of the while loop at line 13 is in O(|V |2 log2 (|V |))
time. Then, for N in O(log2 (|V |)), the total complexity for the Delay K-way
Fiduccia-Mattheyses local search Algorithm 11 is in O(|V |2 + |V |2 + |V |2 log2 (|V |)),
that is, in O(|V |2 log2 (|V |)) time.
The dedicated data structure, first introduced by C. M. Fiduccia and R. M.
Mattheyses [75] and extended to the DKFM algorithm, can have a high memory
footprint for large red-black hypergraphs. As presented in the thesis of S. Schlag
and previous work [146, 173], priority queues can be used to store best moves for
each part. Consequently, we will also use priority queues to improve the complexity
of our DKFM algorithm. Let us study the complexity of DKFM with priority queue
data structures, instead of the classic FM data structure.
Proof. The algorithm applies several refinement steps to an initial partition given
as a parameter. First, the algorithm computes the halo λΠ , i.e., the set of vertices
located at the frontier of the partition. In the worst case, this calculation is
performed in O(|V |2 ) time, each vertex connected to all the other vertices. Then,
the gain for each vertex in the halo is computed in O(k × |V |) time, because the
computation of the new critical path in O(|V |) time has to be performed for each
connected part. In the worst case, the halo contains all vertices and each vertex is
connected to each part. Then, as a priority queue is used instead of doubly-linked
list, the insertion of a vertex is performed in log2 (V ) time. A heap data structure
is used to represent the priority queues. Then, the worst time complexity for the
calculation of the vertex gain is in O(k×|V |×(|V |+log2 (V ))) = O(k×|V |2 ). At this
step, the complexity is similar to the previous version, because the computation
of the critical path has a higher complexity than gain insertion. Each best move
is present in the head of each priority queues.
The second part of the algorithm consists of N vertex movements across the
frontier of the partition. For each movement, a vertex is dequeued from a selected
candidate part. The selection of a candidate part is made in O(k) time and
the dequeued operation in O(log2 (V )) time. Then, after each move, the gains of
neighbor vertices have to be updated. In the worst case, the number of neighbors
is |V |, that is, the worst time complexity for updating gains is in O(|V | × (V +
log2 (|V |))). Since the while loop is repeated N times, the complexity for the
In the previous Lemma 6.1.3, we show that the use of priority queues instead
of doubly-linked lists does not degrade the complexity of DKFM in the worst case.
This is specific to the problem of path cost minimization, because the computation
of gain forces the evaluation of the critical path for each partition. As the calcu-
lation of the critical path is performed with a higher complexity, the management
of the priority queue is absorbed in the worst case complexity analysis.
6.2.1 Methodology
In Chapter 2, we presented existing approaches based on pre- and post-processes
using min-cut solvers as a main algorithm for partitioning. To the best of our
knowledge, there is no publicly available tool to tackle our problem; hence, the
scientific community has developed such procedures in combination with existing
min-cut tools to handle path cost. Following these approaches, we investigated
the efficiency of a refinement algorithm dedicated for our problem, combined with
existing min-cut tools.
Chronologically, the DKFM algorithm is the first we have developed in the
course of this thesis. For this reason, we have created a benchmark in which
DKFM is a post-processing procedure of a mincut tool. However, the pre- and
post-processes presented in the state of the art are not publicly available. This
makes it difficult to compare against them. Hence, the aim of this benchmark is
to measure the relevance of using DKFM as a post-processing method.
To test the DKFM refinement algorithm, we first ran hMetis, PaToH, khMetis,
and kKaHyPar on the weighted hypergraph instances. We use the r∗ weighting
160 J. Rodriguez
6. Refinement algorithms
1
https://kahypar.org/
5.0
0.0
Circuit instance
Figure 6.4: Results of improvement produced by DKFM applied to partitions computed by hMetis, khMetis,
PaToH, KaHyPar, and TopoPart onto T1. For this topology, KaHyPar+DKFM seems to be better than
the others. However, khMetis+DKFM curves is under KaHyPar+DKFM for some instances. Note that DKFM
evidences difficulties to improve partitions produced by TopoPart and PaToH.
J. Rodriguez
6.2. Experimental results
30 hmetis
hmetis+dkfm
khmetis
25 khmetis+dkfm
patoh
patoh+dkfm
20 kahypar
6. Refinement algorithms
kahypar+dkfm
15 topopart
topopart+dkfm
0
Circuit instance
Figure 6.5: Results of improvement produced by DKFM applied to partitions computed by hMetis, khMetis,
PaToH, KaHyPar, and TopoPart onto T2. For this topology, KaHyPar+DKFM seems to be better than the
others except from index 25 where khMetis+DKFM produced less degradation. DKFM shows difficulties to reduce
the cost of partition produced by TopoPart. Note that hMetis and hMetis+DKFM have better results for 75%
of circuits compared to PaToH and PaToH+DKFM but hMetis and hMetis+DKFM produced higher maximal
possible degradations than PaToH and PaToH+DKFM.
163
6.2. Experimental results
14 12
Logarithm of the execution time in ms (y = log(t))
12 10
8 6
6 4
4 2
2 0
tis fm tis fm toh fm par
hme tis+dk khme tis+dk pa toh+dk kahy + d kfmpopart +dkfm
r o t
hme h me p a a hypa t po par
k k to
Algorithm
Figure 6.6: Results of improvement produced by DKFM applied to partition com-
puted by hMetis, khMetis, PaToH, KaHyPar, and TopoPart onto fully
connected T3 composed of 4 parts. Each point is a circuit degradation corre-
sponding to its algorithm. Each boxplot presents the logarithmic execution times
in milliseconds for all circuits. The total DKFM execution time equals the sum of
the execution time of the min-cut algorithm plus the DKFM execution time. For
this topology, KaHyPar+DKFM and hMetis+DKFM seems to be better than
other for critical path degradation. However, we point out that PaToH produced
less maximal degradation than the others. khMetis+DKFM is the third best
algorithm here. Hence, we show limitations on DKFM improvement for PaToH.
Note that DKFM’s execution time is longer, but remember that a time limit for
DKFM is set at 400 seconds, which is still reasonable for the circuit sizes processed
and the results delivered.
164 J. Rodriguez
hmetis
40 hmetis+dkfm
khmetis
khmetis+dkfm
patoh
30 patoh+dkfm
6. Refinement algorithms
kahypar
kahypar+dkfm
topopart
10
165
166
40 hmetis
hmetis+dkfm
35 khmetis
khmetis+dkfm
30 patoh
patoh+dkfm
25 kahypar
kahypar+dkfm
20 topopart
topopart+dkfm
15
10
J. Rodriguez
6.2. Experimental results
6. Refinement algorithms
16
Logarithm of the execution time in ms (y = log(t))
14
14
12
2 0
tis fm tis fm toh fm par
hme tis+dk khme tis+dk pa toh+dk kahy + d kfmpopart +dkfm
par to art
hme k h me p a
ka hy
to po p
Algorithm
Figure 6.9: Results of improvement produced by DKFM applied to partitions
computed by hMetis, khMetis, PaToH, KaHyPar, and TopoPart onto fully
connected T6 composed of 8 parts. Each point is a circuit degradation correspond-
ing to its algorithm. Each boxplot presents the logarithm of the execution times
in milliseconds for all circuits. The total DKFM execution time equals the sum of
the execution time of the min-cut algorithm plus the DKFM execution time. For
this topology, KaHyPar+DKFM seems to be better than the others for critical
path degradation. hMetis+DKFM is the second best algorithm here. Hence, we
evidence limitations on DKFM improvement for PaToH.
168 J. Rodriguez
17.5 hmetis
hmetis+dkfmfast
khmetis
15.0 khmetis+dkfmfast
patoh
12.5 patoh+dkfmfast
6. Refinement algorithms
kahypar
10.0 kahypar+dkfmfast
topopart
topopart+dkfmfast
5.0
0.0
Circuit instance
Figure 6.10: Results of improvement produced by DKFMFAST applied to partitions computed by hMetis,
khMetis, PaToH, KaHyPar, TopoPart onto T1. For this topology, KaHyPar+DKFMFAST seems to
be better than the others for 27 circuits. However, results of khMetis+DKFMFAST are under KaHy-
Par+DKFMFAST results for two instances, that is, khMetis+DKFMFAST produced less higher degradation
than KaHyPar+DKFMFAST. Note that DKFMFAST shows difficulties to improves their partitions.
169
170
30 hmetis
hmetis+dkfmfast
khmetis
25 khmetis+dkfmfast
patoh
patoh+dkfmfast
20 kahypar
kahypar+dkfmfast
15 topopart
topopart+dkfmfast
10
0
Circuit instance
Figure 6.11: Results of improvement produced by DKFMFAST applied to partitions computed by hMetis,
khMetis, PaToH, KaHyPar, and TopoPart onto T2. For this topology, KaHyPar+DKFMFAST seems
to be better than the others. However, khMetis+DKFMFAST results are under KaHyPar+DKFMFAST results
for two instances, that is, khMetis+DKFMFAST produced less higher degradation than KaHyPar+DKFMFAST.
Note that DKFMFAST has difficulties to improves their partitions, as for T1.
J. Rodriguez
6.2. Experimental results
6. Refinement algorithms
14 12
Logarithm of the execution time in ms (y = log(t))
12 10
8 6
6 4
4 2
2 0
hme
t is
f mfhamstetis fmpfaasttoh f m fashtypar f mfp
aospt art
f mf ast
a
t i s +dk k t is +dk oh +dk k a r +dk to r t + dk
hme khm
e pat kah
yp opa
top
Algorithm
Figure 6.12: Results of improvement produced by DKFMFAST applied to par-
titions computed by hMetis, khMetis, PaToH, KaHyPar, and TopoPart
onto fully connected T3 composed of 4 parts. Each point is a circuit degrada-
tion corresponding to its algorithm. Each boxplot presents the logarithm of the
execution times in milliseconds for all circuits. The total DKFMFAST execu-
tion time equals the sum of the execution time of the min-cut algorithm plus
the DKFMFAST execution time. For this topology, khMetis+DKFMFAST
results on critical path degradation seems to be better than the others. We
point out that PaToH produced less possible critical path degradation than
the others. hMetis+DKFMFAST is the second algorithm with better results.
Hence, in contrary to previous Figures, we show DKFMFAST improvement lim-
itations for hMetis partitions. Note that the total execution times of Pa-
ToH+DKFMFAST and TopoPart +DKFMFAST, increase significantly because
PaToH and TopoPart execution times are fast compared to the DKFMFAST
execution time. For the other tools, DKFMFAST slightly increases the execution
time compared to the total procedure.
10
J. Rodriguez
6.2. Experimental results
40 hmetis
hmetis+dkfmfast
35 khmetis
khmetis+dkfmfast
30 patoh
6. Refinement algorithms
patoh+dkfmfast
25 kahypar
kahypar+dkfmfast
20 topopart
173
6.2. Experimental results
174 J. Rodriguez
6. Refinement algorithms
16
Logarithm of the execution time in ms (y = log(t))
14
14
12
2 0
t is amstetis aasttoh ashtypar aospt art ast
hme mf
f h fmpf m
f af m
f pf
dkf mf
t i s +dk k t is +dk oh +dk k a r +dk to r t +
hme khm
e pat kah
yp opa
top
Algorithm
Figure 6.15: Results of improvement produced by DKFMFAST applied to parti-
tions computed by hMetis, khMetis, PaToH, KaHyPar, and TopoPart onto
fully connected T6 composed of 8 parts. Each point is a circuit degradation cor-
responding to its algorithm. Each boxplot presents the logarithm of the execution
times in milliseconds for all circuits. The total DKFMFAST execution time equals
the sum of the execution time of the min-cut algorithm plus the DKFMFAST
execution time. For this topology, KaHyPar+DKFMFAST seems to produced
better critical path degradation results than the others. khMetis+DKFMFAST
produced the second best results here. Hence, we show DKFMFAST improvement
except for hMetis, PaToH and TopoPart. The analysis of execution times is
similar to previous figures.
J. Rodriguez
6.2. Experimental results
7 hmetis
hmetis+dkfm
6 khmetis
khmetis+dkfm
patoh
5 patoh+dkfm
6. Refinement algorithms
kahypar
4 kahypar+dkfm
topopart
topopart+dkfm
177
6.3. Conclusion
6.3 Conclusion
In this chapter, we presented an extension of the FM refinement algorithm to the
problem of partitioning red-black hypergraphs, aiming at minimizing critical path
degradation. This algorithm, called DKFM, considers routing costs to optimize
path allocation within the partition. The calculation of the critical path when
updating the gains is expensive, because all vertices must be processed. Indeed, as
shown in Figure 6.2, a move can change the critical path, which can arise anywhere
in the DAH. This aspect is an essential difference between the cut minimization
problem and the path minimization problem. Indeed, when a vertex is moved
across parts, cut costs change only for connected hyperarcs. In contrast, when
minimizing fp , moving a vertex locally has a global effect on the whole hypergraph.
This property makes this problem more difficult than min-cut for FM-type local
improvement algorithms.
In VLSI design, execution time is an important issue for tools. As fp mini-
mization involves to update the critical path after each move, DKFM is inevitably
slow. Therefore, we presented DKFMFAST, a faster version of our DKFM algo-
rithm that limits the evaluation of the critical path. DKFMFAST computes a
subset of gains in addition to the cut capacity ς, which reduces the number of
evaluations of the critical path during the moves.
In Section 6.2, we presented experimental results on some circuits and target
topologies, both introduced in Chapter 3. The results show that, in most cases,
DKFM succeeds in refining the partitions created by the min-cut tools. In this
setting, DKFM only performs a maximum of 20% of the moves (ρ = 20%), and
the run time limit setting, is set to 400s.
We observed that DKFM experienced some trouble to improve some of the
partitions derived from PaToH. Among all min-cut algorithms, the combination
of KaHyPar with DKFM and DKFMFAST generally gives the best refinement
outcome.
178 J. Rodriguez
Conclusion and Perspectives
179
Summary of the dissertation
180 J. Rodriguez
Conclusion and Perspectives
from the initial partitioning step is a main part of the perspectives of this thesis.
In this thesis, we have proposed algorithms for each phase of the multilevel
scheme. The combination of these algorithms creates a multilevel scheme for par-
titioning red-black hypergraphs. Since the multilevel scheme has proven its effi-
ciency in partitioning hypergraphs, future algorithms aimed at minimizing path
cost in partitioning red-black hypergraphs should take advantage of the multilevel
scheme, combined with the proposed algorithms.
182 J. Rodriguez
Conclusion and Perspectives
184 J. Rodriguez
Appendix A
Experimental results
185
A.1. Numerical results of clustering algorithms
186 J. Rodriguez
Table A.1: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 2.
187
188
Table A.2: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 4.
J. Rodriguez
Table A.3: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 8.
189
190
Table A.4: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 16.
J. Rodriguez
Table A.5: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 32.
191
192
Table A.6: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 64.
J. Rodriguez
Table A.7: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 128.
193
194
Table A.8: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 256.
J. Rodriguez
Table A.9: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 512.
195
196
Table A.10: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 1024.
J. Rodriguez
Table A.11: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 2048.
197
198
Table A.12: Path cost results of clustering algorithms: HEM and BSC, for circuits in ITC, Chipyard and Titan with
maximum cluster size set to 4096.
J. Rodriguez
A. Experimental results
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.14: Results for critical path: dΠ
max (H), on target T2 for circuits in ITC set.
b03
b04 4.84 4.07 5.76 4.46 9.61 3.63 8.06 8.73
b05 1.22 1.27 1.22 1.50 5.62 1.90 2.89 6.08
b06 21.46 15.18 15.18 19.22 27.20 18.23 24.15 18.29
3.00 3.83 3.04 3.75 12.33 3.29 6.02 4.21
201
202
Table A.15: Results for critical path: dΠ
max (H), on target T3 for circuits in ITC set.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.16: Results for critical path: dΠ
max (H), on target T4 for circuits in ITC set.
b03
b04 7.84 8.98 7.39 7.28 9.65 6.50 11.07 10.47
b05 3.67 2.88 3.69 4.23 5.58 3.45 7.46 5.89
b06 18.41 24.33 21.46 30.48 42.80 27.55 33.48 27.26
7.46 7.10 7.87 7.66 11.33 6.85 12.43 5.29
203
204
Table A.17: Results for critical path: dΠ
max (H), on target T5 for circuits in ITC set.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.18: Results for critical path: dΠ
max (H), on target T6 for circuits in ITC set.
b03
b04 4.23 4.23 3.35 3.58 8.98 4.10 7.67 4.06
b05 1.81 1.80 1.83 1.82 5.77 1.79 5.08 3.09
b06 12.31 9.26 9.26 13.76 12.13 15.36 12.13 8.97
4.19 4.89 4.06 4.06 7.65 3.58 4.39 1.67
205
206
Table A.19: Results for critical path: dΠ
max (H), on target T1 for circuits in Chipyard and Titan sets.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.20: Results for critical path: dΠ
max (H), on target T2 for circuits in Chipyard and Titan sets.
A. Experimental results
207
208
Table A.21: Results for critical path: dΠ
max (H), on target T3 for circuits in Chipyard and Titan sets.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.22: Results for critical path: dΠ
max (H), on target T4 for circuits in Chipyard and Titan sets.
A. Experimental results
209
210
Table A.23: Results for critical path: dΠ
max (H), on target T5 for circuits in Chipyard and Titan sets.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.24: Results for critical path: dΠ
max (H), on target T6 for circuits in Chipyard and Titan sets.
A. Experimental results
211
A.2. Numerical results of initial partitioning algorithms
212 J. Rodriguez
Table A.25: Results for connectivity cost: fλ (H Π ), on target T3 for circuits in ITC set.
b03
b04 43169 43144 50322 44097 120418 220814 213636 161626
b05 33429 37964 39377 41717 206369 189284 233129 166778
b06 18915 23745 26575 18244 25965 39980 37675 31050
40739 50279 52293 45156 121845 159148 211547 142385
213
214
Table A.26: Results for connectivity cost: fλ (H Π ), on target T6 for circuits in ITC set.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.27: Results for connectivity cost: fλ (H Π ), on target T3 for circuits in Chipyard and Titan sets.
A. Experimental results
215
216
Table A.28: Results for connectivity cost: fλ (H Π ), on target T6 for circuits in Chipyard and Titan sets.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
A. Experimental results
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.30: Results for balance cost: β(H Π ), on target T6 for circuits in ITC set.
b03
b04 2.86 1.27 1.00 1.50 1.67 1.67 1.53 1.67
b05 3.91 1.00 1.00 1.51 1.68 1.68 1.58 1.68
b06 4.45 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4.60 1.23 1.00 1.45 1.68 1.68 1.45 1.68
219
220
Table A.31: Results for balance cost: β(H Π ), on target T3 for circuits in Chipyard and Titan sets.
J. Rodriguez
A.2. Numerical results of initial partitioning algorithms
Table A.32: Results for balance cost: β(H Π ), on target T6 for circuits in Chipyard and Titan sets.
A. Experimental results
221
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
222 J. Rodriguez
Table A.33: Results for critical path: dΠ
max (H), on target T1 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 7.87 7.87 7.87 7.27 10.31 10.31 8.18 7.72 13.05 12.90
b02 12.31 12.31 12.31 12.13 9.09 9.09 15.18 12.13 12.31 12.13
6.06 4.63 4.63 7.87 7.40 6.06 7.61 5.97 13.01 13.01
A. Experimental results
b03
b04 4.03 3.89 3.35 3.03 3.35 2.75 3.36 2.22 11.42 11.42
b05 0.93 0.93 0.64 0.96 1.18 1.18 1.15 0.74 9.07 8.76
b06 11.96 9.26 12.31 12.13 12.31 12.31 14.98 12.31 15.36 15.18
2.94 2.94 3.61 2.78 3.04 3.04 3.17 2.55 7.07 7.07
223
224
Table A.34: Results for critical path: dΠ
max (H), on target T2 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 10.31 10.31 13.05 10.31 12.75 12.60 10.59 10.01 20.37 17.93
b02 15.18 15.01 18.05 12.31 18.41 18.41 17.01 15.18 24.33 21.46
b03 7.87 7.78 7.87 9.30 7.68 7.68 9.56 8.15 12.72 12.72
b04 4.84 4.84 4.07 3.99 5.76 5.62 4.46 2.54 9.61 9.05
b05 1.22 1.22 1.27 0.96 1.22 1.22 1.50 1.17 5.62 5.62
b06 21.46 21.10 15.18 12.13 15.18 15.18 19.22 15.18 27.20 21.46
b07 3.00 3.00 3.83 2.88 3.04 3.04 3.75 2.74 12.33 11.05
b08 5.22 5.22 7.17 7.17 6.69 5.83 6.33 3.70 15.51 12.46
b09 17.93 14.35 17.93 7.21 10.78 10.78 15.07 10.78 21.50 19.71
b10 8.23 8.07 7.91 7.91 10.32 8.23 8.49 7.83 16.01 15.85
b11 4.74 3.92 4.89 4.36 6.99 5.82 4.91 3.36 7.99 7.08
b12 4.16 4.16 3.24 3.76 3.61 3.24 4.94 4.11 11.44 11.18
b13 3.38 1.81 1.81 2.10 2.89 2.89 2.17 1.95 8.38 8.38
b14 2.12 2.12 3.23 2.84 3.94 3.40 2.80 1.98 10.13 10.13
b17 0.68 0.67 0.81 1.65 0.81 0.81 0.67 0.40 10.32 7.88
b18 0.20 0.21 0.21 0.21 0.31 0.31 0.29 0.21 6.44 6.44
b19 0.30 0.38 0.38 0.38 0.31 0.31 0.34 0.28 7.04 6.84
b20 4.01 1.52 2.17 2.17 3.26 3.20 2.29 1.03 11.33 11.31
b21 2.99 0.98 2.00 2.00 2.21 2.20 2.59 1.73 9.59 9.36
b22 1.63 1.79 3.80 3.80 0.88 0.88 1.78 0.47 8.99 8.82
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.35: Results for critical path: dΠ
max (H), on target T3 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 5.28 5.28 7.87 7.27 7.72 7.72 7.42 5.28 10.31 10.31
b02 9.26 9.26 9.26 9.26 6.21 6.21 12.13 9.26 12.31 9.26
4.54 4.54 4.63 6.25 4.45 4.45 6.97 4.92 11.39 9.77
A. Experimental results
b03
b04 3.03 3.03 2.15 2.79 3.03 2.58 2.69 1.98 4.23 4.23
b05 0.93 0.93 0.64 0.64 1.18 1.18 1.06 0.74 5.91 5.91
b06 9.26 9.09 6.21 6.21 6.21 6.21 11.56 9.09 12.31 9.26
2.40 2.40 2.52 2.21 2.74 2.74 2.44 1.98 5.85 5.40
225
226
Table A.36: Results for critical path: dΠ
max (H), on target T4 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15.49 15.19 15.49 15.64 17.93 17.78 20.86 15.64 30.74 30.74
b02 18.41 18.41 21.28 24.51 24.51 21.46 31.09 21.28 21.28 21.28
b03 12.54 11.20 21.10 9.68 21.10 14.06 17.81 8.15 20.91 20.91
b04 7.84 7.24 8.98 7.14 7.39 7.39 7.28 5.83 9.65 9.58
b05 3.67 3.67 2.88 3.48 3.69 3.41 4.23 2.80 5.58 5.58
b06 18.41 18.41 24.33 27.38 21.46 21.46 30.48 24.15 42.80 42.80
b07 7.46 6.37 7.10 8.16 7.87 7.72 7.66 6.01 11.33 10.37
b08 8.08 8.08 15.45 16.54 15.45 15.45 13.85 6.25 14.47 14.47
b09 23.28 21.50 14.35 17.93 23.28 22.87 20.07 10.78 25.07 24.65
b10 21.69 21.69 16.16 13.52 13.68 13.68 19.49 13.68 23.05 21.69
b11 6.23 6.23 4.89 6.29 4.42 4.42 7.00 5.03 7.84 7.40
b12 9.38 9.17 6.80 8.40 15.42 15.42 8.22 6.44 14.49 11.86
b13 1.27 1.27 2.84 2.64 6.87 5.98 7.26 4.51 7.89 6.23
b14 5.53 5.48 5.76 4.76 5.84 5.30 4.90 3.78 11.57 10.53
b17 2.70 2.70 3.81 2.43 3.85 3.85 2.47 1.50 7.75 7.38
b18 0.41 0.35 0.71 0.71 0.93 0.93 0.43 0.25 4.93 4.93
b19 0.75 0.79 0.58 0.58 0.38 0.38 0.66 0.51 5.61 5.61
b20 2.88 2.73 6.05 6.05 5.11 5.05 4.72 3.33 9.80 9.34
b21 6.10 5.02 4.50 4.50 4.71 4.69 4.33 3.27 8.86 8.61
b22 4.02 2.88 4.73 4.63 3.34 3.34 3.73 2.69 7.84 7.84
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.37: Results for critical path: dΠ
max (H), on target T5 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 20.52 15.64 23.11 17.93 20.52 20.52 25.61 20.67 38.81 38.81
b02 18.41 18.23 30.60 24.33 27.55 27.55 32.68 21.28 30.43 24.51
15.77 13.01 22.72 11.20 22.72 22.72 16.57 11.29 32.24 30.62
A. Experimental results
b03
b04 12.06 9.76 9.61 8.94 9.30 8.70 9.13 7.32 9.44 9.30
b05 4.34 4.08 3.38 4.74 3.69 3.69 4.90 3.43 8.42 7.86
b06 36.70 30.43 30.43 27.38 21.46 21.46 35.34 21.46 21.28 18.41
6.94 6.91 10.88 10.34 8.93 8.26 9.25 7.36 14.99 8.48
227
228
Table A.38: Results for critical path: dΠ
max (H), on target T6 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 7.87 7.72 7.87 7.72 7.72 7.72 9.93 7.87 15.64 13.05
b02 9.26 9.26 12.31 12.31 9.26 9.26 13.17 9.26 15.36 15.36
b03 4.92 4.82 8.15 8.15 8.15 6.53 8.10 6.44 11.29 11.29
b04 4.23 3.63 4.23 3.89 3.35 3.35 3.58 2.72 8.98 8.84
b05 1.81 1.59 1.80 1.80 1.83 1.83 1.82 1.22 5.77 5.49
b06 12.31 9.26 9.26 9.26 9.26 9.26 13.76 9.26 12.13 12.13
b07 4.19 3.64 4.89 4.89 4.06 4.03 4.06 3.29 7.65 6.56
b08 6.25 6.25 8.26 8.26 6.13 6.13 8.12 6.13 10.39 8.08
b09 10.78 10.78 9.00 9.00 10.78 10.78 10.78 7.21 14.35 12.57
b10 6.86 6.86 6.78 6.78 6.86 6.86 8.44 6.86 9.51 9.51
b11 1.93 1.93 2.34 2.34 2.34 2.07 3.01 2.43 4.47 4.47
b12 4.99 4.84 4.11 4.11 4.94 4.94 4.25 3.39 7.42 6.59
b13 1.27 1.27 1.70 1.70 2.05 1.95 2.93 1.95 7.35 5.29
b14 2.00 1.99 2.37 2.37 3.14 3.13 2.54 1.84 6.54 6.30
b17 0.86 0.84 1.17 1.12 1.45 1.45 0.98 0.86 5.02 4.91
b18 0.20 0.20 0.21 0.21 0.33 0.33 0.26 0.21 4.16 4.16
b19 0.28 0.28 0.28 0.28 0.21 0.21 0.29 0.26 4.86 4.86
b20 2.02 1.94 2.48 2.48 2.48 2.48 1.90 1.54 6.79 6.38
b21 2.08 2.08 1.70 1.70 2.23 2.23 1.87 1.40 5.91 5.71
b22 2.08 2.05 1.83 1.83 1.45 1.45 1.59 1.31 6.23 6.23
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.39: Results for critical path: dΠ
max (H), on target T1 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.62 0.61 1.66 1.62 3.91 3.91 0.49 0.03 13.06 13.06
mnist 9.02 5.97 5.97 4.54 10.83 9.77 4.97 2.45 17.86 17.86
mobilnet1 1.82 1.82 1.72 1.72 6.78 6.78 1.83 1.10 10.08 10.08
229
230
Table A.40: Results for critical path: dΠ
max (H), on target T2 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.84 0.73 2.08 2.08 3.92 3.92 0.80 0.36 20.92 20.92
mnist 9.02 5.97 5.97 4.54 17.49 16.24 5.16 2.73 30.71 25.95
mobilnet1 1.82 1.82 2.82 2.82 7.99 7.99 1.83 1.10 10.01 10.01
OneCore 1.25 1.25 1.72 1.70 4.25 4.09 0.74 0.01 17.67 14.69
PuLSAR 0.49 0.49 1.12 1.12 7.13 7.13 0.77 0.13 19.04 19.04
WasgaServer 0.44 0.44 2.26 2.26 5.74 5.74 0.56 0.01 23.91 23.91
bitonic_mesh 1.34 0.71 4.46 4.46 18.66 18.66 1.43 1.34 18.83 18.83
cholesky_bdti 4.67 2.82 2.00 2.00 6.68 6.68 2.49 1.90 19.89 19.89
dart 1.97 1.97 2.25 2.25 2.80 2.80 3.29 1.41 8.68 8.68
denoise 0.00 0.00 0.00 0.00 0.07 0.07 0.00 0.00 7.13 7.13
des90 2.94 2.15 1.47 1.47 13.20 13.20 1.62 1.34 18.97 18.97
xge_mac 2.50 2.50 5.12 5.12 9.90 9.81 4.28 3.98 20.59 14.59
cholesky_mc 2.11 2.11 2.00 2.00 5.65 4.88 2.87 1.90 21.73 21.73
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.41: Results for critical path: dΠ
max (H), on target T3 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.15 0.08 0.46 0.46 1.61 1.61 0.08 0.01 8.68 8.68
mnist 4.17 2.92 2.92 2.92 6.53 6.44 2.78 1.30 11.11 11.11
mobilnet1 1.82 1.82 1.72 1.72 3.11 3.11 1.83 1.10 6.34 6.34
231
232
Table A.42: Results for critical path: dΠ
max (H), on target T4 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.62 0.62 3.21 3.21 8.36 8.36 2.02 0.85 16.55 16.55
mnist 11.01 11.01 9.39 9.39 19.29 19.20 9.81 5.88 22.06 22.06
mobilnet1 5.35 5.35 4.95 4.95 12.57 12.57 4.19 1.83 13.05 13.05
OneCore 1.25 1.16 1.88 1.88 6.34 5.31 2.12 1.56 14.00 10.29
PuLSAR 2.92 2.92 2.20 2.20 8.68 8.68 2.23 1.07 28.98 28.98
WasgaServer 0.93 0.93 0.77 0.77 9.06 9.06 0.98 0.02 20.40 20.40
bitonic_mesh 2.10 2.10 2.94 2.94 24.21 24.21 2.21 2.10 16.60 16.60
cholesky_bdti 2.98 2.98 4.67 3.85 20.76 20.76 4.03 2.82 18.25 18.25
dart 3.93 3.93 4.21 4.21 5.89 5.89 4.71 3.37 11.50 11.50
denoise 0.00 0.00 0.02 0.02 1.80 1.80 0.00 0.00 5.55 5.55
des90 2.94 2.23 2.94 2.94 18.88 18.88 2.47 2.23 19.68 19.68
xge_mac 9.55 9.55 9.55 9.55 18.86 18.77 10.49 5.46 20.42 17.81
cholesky_mc 2.82 2.82 6.62 6.57 15.32 15.32 3.98 2.11 18.30 18.30
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.43: Results for critical path: dΠ
max (H), on target T5 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.99 0.99 3.83 3.83 8.57 8.57 2.33 0.85 29.35 29.35
mnist 12.63 12.63 12.54 12.54 28.81 25.48 14.24 9.30 33.76 28.91
mobilnet1 7.22 7.22 6.75 6.75 21.93 21.93 4.88 3.08 18.74 18.74
233
234
Table A.44: Results for critical path: dΠ
max (H), on target T6 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.01 0.01 1.12 1.12 2.67 2.67 0.20 0.02 9.74 9.74
mnist 4.54 4.54 4.54 4.45 9.39 9.30 5.35 4.54 11.01 9.77
mobilnet1 2.23 2.23 1.83 1.83 6.34 6.34 1.92 0.58 5.72 5.72
OneCore 0.41 0.32 0.29 0.29 2.39 2.39 0.70 0.23 11.04 8.55
PuLSAR 0.44 0.44 0.21 0.21 2.58 2.58 0.56 0.48 9.63 9.63
WasgaServer 0.01 0.01 0.01 0.01 4.32 4.32 0.08 0.01 7.24 7.24
bitonic_mesh 1.42 0.66 1.42 1.42 12.83 12.83 0.09 0.03 9.79 9.79
cholesky_bdti 0.97 0.97 2.00 2.00 7.82 7.82 0.92 0.21 8.96 8.96
dart 2.53 2.53 2.25 2.25 2.28 2.28 2.78 2.25 7.28 7.28
denoise 0.00 0.00 0.01 0.01 1.65 1.65 0.00 0.00 1.53 1.53
des90 1.42 1.42 1.42 1.42 8.18 8.18 0.24 0.03 9.61 9.61
xge_mac 4.41 4.32 4.41 4.32 6.85 6.85 5.18 3.98 8.59 7.11
cholesky_mc 1.08 1.08 1.13 1.08 6.08 6.08 1.02 0.97 7.11 7.11
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
A. Experimental results
Connectivity cost
Results on connectivity cost presented in Chapter 6 are based on tables introduced
in this subsection. Each table shows us effect of DKFM on connectivity cost of
partition produced by min-cut algorithms. These results have been used to make
the figures presenting the relative connectivity cost of each partition. The following
figures corresponds to the complementary results of the one presented in Chapter 6.
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
hmetis
8 hmetis+dkfm
khmetis
khmetis+dkfm
patoh
A. Experimental results
6 patoh+dkfm
kahypar
kahypar+dkfm
topopart
237
238
8 hmetis
hmetis+dkfm
7 khmetis
khmetis+dkfm
6 patoh
patoh+dkfm
5 kahypar
kahypar+dkfm
4 topopart
topopart+dkfm
3
2
1
0
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
hmetis
8 hmetis+dkfm
khmetis
khmetis+dkfm
patoh
6
A. Experimental results
patoh+dkfm
kahypar
kahypar+dkfm
topopart
239
240
Table A.45: Results for connectivity cost: fλ (H Π ) × 103 , on target T1 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15.78 17.56 19.56 20.78 20.04 20.04 14.15 15.26 30.78 32.78
b02 14.31 17.91 17.22 17.91 17.91 17.91 12.82 14.00 19.44 20.44
b03 18.54 18.54 29.96 29.96 22.10 30.34 18.09 19.21 58.73 58.73
b04 43.17 43.17 43.14 43.14 50.32 50.44 44.10 41.02 214.39 214.39
b05 33.43 33.43 37.96 37.96 39.38 39.38 41.72 34.53 329.00 334.41
b06 18.91 21.91 23.75 23.75 26.57 26.57 18.24 19.66 30.40 28.41
b07 40.74 50.76 50.28 50.28 52.29 52.29 45.16 40.14 133.72 133.72
b08 29.97 29.97 35.70 35.70 36.40 36.40 29.89 31.46 66.35 68.10
b09 24.32 24.32 23.43 32.47 29.83 29.83 24.72 26.04 79.37 84.19
b10 35.58 35.58 37.90 37.90 46.03 46.03 32.68 34.31 90.65 90.97
b11 45.56 66.14 58.16 64.93 60.17 82.88 52.72 58.00 228.19 228.19
b12 40.71 61.50 42.73 66.01 44.63 44.63 40.93 39.94 321.71 344.11
b13 2.51 7.44 5.76 5.76 6.26 6.26 7.00 7.26 76.95 83.05
b14 228.53 228.53 309.84 309.84 375.90 375.90 293.80 226.81 2709.14 2939.55
b17 153.01 245.69 187.88 187.88 236.40 236.40 227.89 155.97 7569.73 7775.84
b18 100.63 100.63 141.06 141.06 143.35 143.35 241.69 156.99 16153.86 16153.86
b19 375.41 375.41 286.12 286.12 145.29 145.29 380.08 306.03 33293.39 33293.39
b20 313.76 313.76 300.33 300.33 425.78 716.51 339.82 255.03 5434.42 5851.04
b21 296.62 296.62 297.89 297.89 391.46 764.62 284.08 235.43 5367.37 5530.10
b22 365.41 365.41 354.11 596.50 348.22 348.22 366.18 300.11 8258.14 8805.05
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.46: Results for connectivity cost: fλ (H Π ) × 103 , on target T2 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15.78 18.78 19.56 19.78 20.04 22.04 14.15 15.26 30.00 28.00
b02 14.31 14.31 17.22 18.22 17.91 17.91 12.82 15.00 18.83 17.83
18.54 24.29 29.96 29.96 22.10 22.10 18.09 19.21 64.44 64.44
A. Experimental results
b03
b04 43.17 43.17 43.14 73.38 50.32 67.18 44.10 41.02 175.29 181.65
b05 33.43 33.43 37.96 37.96 39.38 39.38 41.72 34.53 241.12 241.12
b06 18.91 21.83 23.75 25.66 26.57 26.57 18.24 19.35 32.62 30.71
40.74 40.74 50.28 50.28 52.29 52.29 45.16 40.14 137.84 150.29
241
242
Table A.47: Results for connectivity cost: fλ (H Π ) × 103 , on target T3 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15.78 17.56 19.56 19.56 20.04 20.04 14.15 17.26 24.04 24.04
b02 14.31 14.31 17.22 17.22 17.91 17.91 12.82 14.00 18.75 20.75
b03 18.54 18.54 29.96 29.96 22.10 22.10 18.09 19.21 40.82 46.44
b04 43.17 43.17 43.14 43.14 50.32 54.90 44.10 41.02 120.42 120.42
b05 33.43 33.43 37.96 37.96 39.38 39.38 41.72 34.53 206.37 206.37
b06 18.91 21.91 23.75 23.75 26.57 26.57 18.24 19.35 25.96 27.96
b07 40.74 40.74 50.28 50.28 52.29 52.29 45.16 40.14 121.84 133.86
b08 29.97 29.97 35.70 35.70 36.40 42.51 29.89 31.46 55.51 55.51
b09 24.32 24.32 23.43 26.47 29.83 29.83 24.72 26.04 53.26 53.26
b10 35.58 35.58 37.90 37.90 46.03 46.03 32.68 34.31 76.66 76.66
b11 45.56 76.77 58.16 73.31 60.17 83.98 52.72 51.08 132.00 132.00
b12 40.71 63.25 42.73 65.34 44.63 44.63 40.93 39.94 194.36 212.00
b13 2.51 2.51 5.76 5.76 6.26 9.35 7.00 7.26 55.84 64.19
b14 228.53 228.53 309.84 309.84 375.90 375.90 293.80 226.81 2303.83 2334.47
b17 153.01 245.69 187.88 187.88 236.40 236.40 227.89 155.97 4440.85 4681.71
b18 100.63 100.63 141.06 141.06 143.35 143.35 241.69 156.99 6535.17 6535.17
b19 375.41 375.41 286.12 286.12 145.29 145.29 380.08 306.03 14537.49 14537.49
b20 313.76 313.76 300.33 300.33 425.78 822.86 339.82 255.03 3565.17 3854.48
b21 296.62 296.62 297.89 297.89 391.46 764.62 284.08 235.19 3598.38 3644.19
b22 365.41 365.41 354.11 354.11 348.22 348.22 366.18 301.02 5974.03 6141.00
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.48: Results for connectivity cost: fλ (H Π ) × 103 , on target T4 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 25.52 25.52 34.26 34.26 34.04 36.04 25.93 28.04 39.78 39.78
b02 24.22 24.53 27.13 27.13 25.83 26.91 21.18 23.05 31.27 31.27
34.58 34.58 52.04 58.61 45.01 46.60 33.86 36.52 94.02 94.02
A. Experimental results
b03
b04 74.75 74.75 91.11 91.11 104.95 104.95 79.66 74.38 245.00 250.68
b05 63.77 63.77 73.49 108.50 74.20 102.72 75.66 63.82 230.59 230.59
b06 30.88 30.88 39.88 39.88 41.10 41.10 29.31 30.71 44.54 44.54
65.31 65.31 89.41 89.41 92.45 113.09 67.89 62.58 155.88 165.26
243
244
Table A.49: Results for connectivity cost: fλ (H Π ) × 103 , on target T5 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 25.52 28.04 34.26 37.26 34.04 34.04 25.93 28.04 37.78 37.78
b02 24.22 24.53 27.13 28.13 25.83 25.83 21.18 23.05 30.66 30.66
b03 34.58 34.58 52.04 56.28 45.01 45.01 33.86 36.52 85.37 86.07
b04 74.75 74.75 91.11 91.11 104.95 128.09 79.66 74.38 263.78 268.95
b05 63.77 63.77 73.49 108.04 74.20 74.20 75.66 63.82 214.16 245.65
b06 30.88 33.57 39.88 42.80 41.10 41.10 29.31 30.71 43.54 42.15
b07 65.31 65.31 89.41 89.41 92.45 108.17 67.89 62.58 164.29 166.24
b08 51.39 51.39 64.64 72.72 61.17 61.17 47.78 47.78 105.33 110.85
b09 44.57 50.90 52.86 54.04 51.76 54.76 39.74 44.18 96.54 101.37
b10 62.66 68.30 72.61 72.61 85.34 87.43 57.91 61.03 111.19 110.74
b11 73.52 82.04 99.79 117.18 92.82 104.53 81.99 77.55 207.91 213.74
b12 81.81 81.81 86.01 92.15 93.82 93.82 81.46 80.00 279.43 287.28
b13 9.19 9.19 14.95 14.95 17.45 17.45 15.43 13.61 77.21 77.21
b14 437.98 609.18 463.58 463.58 637.36 637.36 513.79 409.04 2198.62 2247.59
b17 457.66 457.66 563.98 563.98 538.76 538.76 522.56 391.49 5495.99 5715.91
b18 273.98 273.98 390.12 390.12 292.30 292.30 395.37 290.07 6015.34 6015.34
b19 523.60 523.60 479.48 479.48 376.70 376.70 605.31 485.28 8424.31 8836.47
b20 630.08 1021.52 653.96 653.96 811.17 811.17 607.30 472.31 4512.30 4545.84
b21 643.99 643.99 626.36 626.36 884.25 884.25 624.57 499.28 4597.18 4681.47
b22 779.01 779.01 803.10 803.10 983.40 983.40 693.72 537.02 5181.18 5279.24
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.50: Results for connectivity cost: fλ (H Π ) × 103 , on target T6 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 25.52 28.78 34.26 35.26 34.04 34.04 25.93 28.04 45.78 43.78
b02 24.22 24.22 27.13 27.13 25.83 25.83 21.18 23.05 33.57 33.57
34.58 34.58 52.04 54.98 45.01 48.57 33.86 36.52 73.38 73.38
A. Experimental results
b03
b04 74.75 74.75 91.11 91.11 104.95 104.95 79.66 74.38 245.08 262.24
b05 63.77 63.77 73.49 108.04 74.20 74.20 75.66 63.82 250.08 273.08
b06 30.88 30.88 39.88 40.88 41.10 41.10 29.31 30.71 41.06 41.06
65.31 65.31 89.41 89.41 92.45 115.07 67.89 62.58 142.33 158.63
245
246
Table A.51: Results for connectivity cost: fλ (H Π ) × 103 , on target T1 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 339.37 427.52 546.69 1246.74 16334.62 16334.62 471.18 375.28 72204.32 72204.32
mnist 59.13 190.47 76.93 115.76 1558.20 1798.81 47.08 49.01 5266.51 5266.51
mobilnet1 124.96 124.96 134.81 134.81 21324.92 21324.92 174.56 171.18 71450.56 71450.56
OneCore 291.40 291.40 354.53 354.53 3662.96 4178.36 284.47 254.86 11224.67 11040.37
PuLSAR 550.20 550.20 673.69 673.69 24496.98 24496.98 738.65 451.50 69758.00 69758.00
WasgaServer 433.53 433.53 565.71 565.71 91152.84 91152.84 551.00 475.16 310204.31 310204.31
bitonic_mesh 211.71 211.71 231.62 231.62 71051.11 71051.11 180.63 184.54 92041.82 92041.82
cholesky_bdti 309.30 5311.50 289.50 289.50 24165.87 24165.87 318.99 318.45 68608.60 68608.60
dart 231.19 231.19 220.92 220.92 8449.84 8449.84 279.80 241.65 20604.75 20604.75
denoise 8.62 8.62 9.08 9.08 611.92 611.92 63.91 7.95 62171.28 62171.28
des90 172.75 370.85 215.51 215.51 33674.90 33674.90 155.23 158.30 52052.23 52052.23
xge_mac 84.21 84.21 92.80 92.80 647.48 647.48 83.70 89.10 1226.54 1204.79
cholesky_mc 170.30 170.30 203.21 247.03 12474.93 12474.93 244.88 208.44 31098.36 32204.75
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.52: Results for connectivity cost: fλ (H Π ) × 103 , on target T2 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 339.37 630.49 546.69 546.69 16334.62 16334.62 471.18 399.64 69043.37 69043.37
mnist 59.13 61.07 76.93 115.76 1558.20 1943.67 47.08 49.01 4701.40 4848.31
mobilnet1 124.96 124.96 134.81 134.81 21324.92 21324.92 174.56 125.39 54896.87 54896.87
247
248
Table A.53: Results for connectivity cost: fλ (H Π ) × 103 , on target T3 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 339.37 417.92 546.69 546.69 16334.62 16334.62 471.18 375.28 40149.34 40149.34
mnist 59.13 190.47 76.93 76.93 1558.20 1777.23 47.08 49.01 3929.59 3929.59
mobilnet1 124.96 124.96 134.81 134.81 21324.92 21324.92 174.56 171.18 57561.03 57561.03
OneCore 291.40 291.40 354.53 354.53 3662.96 3981.09 284.47 254.86 8019.80 8074.33
PuLSAR 550.20 550.20 673.69 673.69 24496.98 24496.98 738.65 451.50 50737.17 50737.17
WasgaServer 433.53 433.53 565.71 565.71 91152.84 91152.84 551.00 475.16 121510.73 121510.73
bitonic_mesh 211.71 211.71 231.62 231.62 71051.11 71051.11 180.63 184.54 43656.45 43656.45
cholesky_bdti 309.30 309.30 289.50 289.50 24165.87 24165.87 318.99 307.08 34461.86 34461.86
dart 231.19 231.19 220.92 220.92 8449.84 8449.84 279.80 241.65 10362.24 10589.51
denoise 8.62 8.62 9.08 9.08 611.92 611.92 63.91 7.95 2576.59 2576.59
des90 172.75 172.75 215.51 215.51 33674.90 33674.90 155.23 158.30 23958.57 23958.57
xge_mac 84.21 84.21 92.80 92.80 647.48 647.48 83.70 89.10 733.54 871.67
cholesky_mc 170.30 170.30 203.21 203.21 12474.93 12474.93 244.88 227.31 16160.26 16160.26
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.54: Results for connectivity cost: fλ (H Π ) × 103 , on target T4 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 553.49 553.49 1066.99 1066.99 32142.63 32142.63 724.25 636.81 63339.58 63339.58
mnist 144.72 144.72 203.84 203.84 3370.12 3540.75 153.08 158.31 4850.16 4850.16
mobilnet1 261.15 261.15 353.53 353.53 34894.44 34894.44 402.84 364.09 80351.36 80351.36
249
250
Table A.55: Results for connectivity cost: fλ (H Π ) × 103 , on target T5 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 553.49 553.49 1066.99 1066.99 32142.63 32142.63 724.25 636.81 72999.51 72999.51
mnist 144.72 144.72 203.84 203.84 3370.12 3557.89 153.08 158.31 5642.35 5621.69
mobilnet1 261.15 261.15 353.53 353.53 34894.44 34894.44 402.84 364.09 77503.64 77503.64
OneCore 697.72 893.57 884.14 884.14 6266.75 6386.59 712.94 571.48 12854.15 12812.20
PuLSAR 1506.55 1506.55 1813.54 1813.54 38721.46 38721.46 1670.17 1257.11 74412.45 74412.45
WasgaServer 752.61 752.61 907.82 907.82 163202.56 163202.56 773.57 671.19 263093.96 263093.96
bitonic_mesh 441.10 441.10 499.45 499.45 131346.60 131346.60 383.18 397.26 94376.97 94376.97
cholesky_bdti 667.28 792.57 672.46 734.12 60250.82 60250.82 731.40 733.16 71894.06 71894.06
dart 397.82 397.82 443.32 443.32 13440.29 13440.29 434.29 390.46 18974.90 19158.62
denoise 14.11 14.11 169.44 169.44 4536.17 4536.17 105.93 15.09 61577.92 61577.92
des90 379.46 379.46 423.98 423.98 53545.13 53383.46 357.38 370.87 45499.73 45499.73
xge_mac 222.31 235.04 297.86 432.27 1133.25 1161.22 220.02 228.51 1577.67 1647.32
cholesky_mc 291.53 291.53 394.86 885.98 20750.47 20750.47 321.48 323.35 29467.61 29467.61
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.56: Results for connectivity cost: fλ (H Π ) × 103 , on target T6 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 553.49 553.49 1066.99 1066.99 32142.63 32142.63 724.25 636.81 48723.13 48723.13
mnist 144.72 144.72 203.84 255.11 3370.12 3446.42 153.08 158.31 4748.70 4974.48
mobilnet1 261.15 261.15 353.53 353.53 34894.44 34894.44 402.84 364.09 66785.10 66785.10
251
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Balance cost
In Chapter 6, we presented a comparison on vertex weight balance of partition.
For this purpose, we use the results presented in the following tables. Each table
shows us effect of DKFM on balance cost of partition produced by min-cut algo-
rithms. The following figures corresponds to the complementary results of the one
presented in Chapter 6.
252 J. Rodriguez
hmetis
|V|
6
(H) × 100
hmetis+dkfm
khmetis
khmetis+dkfm
5
A. Experimental results
patoh
patoh+dkfm
kahypar
4 kahypar+dkfm
253
254
hmetis
|V|
6
(H) × 100
hmetis+dkfm
khmetis
khmetis+dkfm
5 patoh
patoh+dkfm
kahypar
4 kahypar+dkfm
topopart
topopart+dkfm
3
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
5.0 hmetis
|V|
(H) × 100
hmetis+dkfm
4.5 khmetis
khmetis+dkfm
4.0 patoh
A. Experimental results
patoh+dkfm
3.5 kahypar
kahypar+dkfm
255
256
5.0 hmetis
|V|
(H) × 100
hmetis+dkfm
4.5 khmetis
khmetis+dkfm
4.0 patoh
patoh+dkfm
3.5 kahypar
kahypar+dkfm
3.0 topopart
topopart+dkfm
2.5
2.0
1.5
1.0
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.57: Results for balance cost: β(H Π ), on target T1 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 2.96 2.96
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.00 2.23 1.00 2.23 1.00 2.23 1.62 1.62 2.23 2.23
A. Experimental results
b03
b04 1.93 2.33 1.53 2.33 1.00 1.40 2.01 1.80 2.07 2.07
b05 4.30 4.30 1.00 1.10 1.19 1.19 1.79 1.10 1.97 2.16
b06 6.17 6.17 1.00 1.00 1.00 1.00 1.00 1.00 2.72 2.72
6.18 6.18 1.68 1.23 1.23 1.23 2.04 1.68 2.35 2.35
257
258
Table A.58: Results for balance cost: β(H Π ), on target T2 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 2.96 1.00 1.00 1.00 1.00 2.96 2.96
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b03 1.00 2.23 1.00 2.23 1.00 1.00 1.62 2.23 2.23 2.23
b04 1.93 1.93 1.53 1.00 1.00 2.33 2.01 1.93 2.33 2.33
b05 4.30 4.30 1.00 1.10 1.19 1.19 1.79 1.10 2.26 2.26
b06 6.17 4.45 1.00 1.00 1.00 1.00 1.00 1.00 2.72 2.72
b07 6.18 6.18 1.68 2.35 1.23 1.23 2.04 1.68 2.35 2.35
b08 3.16 3.16 1.54 1.54 1.00 2.08 2.08 1.54 2.08 2.08
b09 5.65 4.49 1.58 2.74 1.00 1.00 2.04 1.58 2.74 2.74
b10 4.37 4.37 1.00 2.44 1.48 2.44 1.58 1.48 2.44 2.44
b11 5.09 3.94 1.38 2.28 1.26 2.28 1.97 1.51 2.28 2.15
b12 4.71 4.71 1.93 1.83 1.00 1.09 2.07 1.93 2.21 2.21
b13 4.54 4.27 1.82 1.54 1.00 1.00 1.51 1.27 2.36 2.36
b14 5.56 5.56 2.15 2.24 1.26 1.26 2.08 1.65 2.25 2.25
b17 4.13 4.13 2.14 2.09 1.49 1.49 2.22 1.95 2.25 2.25
b18 1.96 2.32 1.97 1.97 1.50 1.50 2.15 2.12 1.89 1.89
b19 1.98 2.42 1.16 1.16 1.26 1.26 2.19 1.83 2.25 2.25
b20 2.13 5.75 1.67 1.67 1.42 2.25 2.20 1.87 2.25 2.24
b21 1.68 5.35 1.13 1.13 1.31 2.25 1.98 1.41 2.25 2.00
b22 1.64 6.03 2.18 2.18 1.35 1.35 2.18 2.05 2.25 2.10
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.59: Results for balance cost: β(H Π ), on target T3 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 2.96 1.00 1.00 1.00 2.96 2.96 2.96
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 2.23 1.00 1.00 1.62 1.62 2.23 2.23
A. Experimental results
b03
b04 1.93 1.93 1.53 1.00 1.00 2.33 2.01 2.07 2.20 2.20
b05 4.30 4.30 1.00 1.10 1.19 1.19 1.79 1.10 2.26 2.26
b06 6.17 4.45 1.00 1.00 1.00 1.00 1.00 1.00 2.72 2.72
6.18 6.18 1.68 1.23 1.23 1.23 2.04 1.68 2.35 2.35
259
260
Table A.60: Results for balance cost: β(H Π ), on target T4 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b03 2.23 1.62 1.00 1.62 1.00 1.62 1.62 1.62 1.62 1.62
b04 2.86 2.73 1.27 1.67 1.00 1.00 1.50 1.40 1.67 1.67
b05 3.91 3.91 1.00 1.00 1.00 1.68 1.51 1.39 1.68 1.68
b06 4.45 4.45 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b07 4.60 3.03 1.23 1.23 1.00 1.68 1.45 1.45 1.68 1.68
b08 3.16 3.16 1.00 1.54 1.00 1.00 1.38 1.54 1.54 1.54
b09 3.33 2.74 1.00 1.58 1.00 1.58 1.58 1.58 1.58 1.58
b10 3.88 3.88 1.48 1.96 1.00 1.00 1.48 1.48 1.96 1.96
b11 3.55 3.55 1.38 1.64 1.00 1.00 1.47 1.51 1.64 1.64
b12 3.13 2.76 1.56 1.65 1.09 1.09 1.54 1.37 1.65 1.65
b13 4.27 4.27 1.27 1.27 1.00 1.82 1.54 1.54 1.82 1.82
b14 4.08 3.48 1.37 1.62 1.04 1.62 1.61 1.56 1.62 1.62
b17 1.51 3.95 1.50 1.51 1.17 1.17 1.62 1.59 1.63 1.63
b18 1.49 2.89 1.48 1.48 1.25 1.25 1.61 1.58 1.63 1.63
b19 1.40 1.95 1.39 1.39 1.13 1.13 1.56 1.38 1.63 1.63
b20 1.58 3.78 1.58 1.58 1.17 1.63 1.59 1.50 1.63 1.63
b21 1.58 3.59 1.51 1.51 1.11 1.63 1.60 1.56 1.63 1.63
b22 1.57 3.23 1.59 1.63 1.16 1.16 1.59 1.45 1.63 1.63
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.61: Results for balance cost: β(H Π ), on target T5 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
2.23 1.62 1.00 1.62 1.00 1.00 1.62 1.62 1.62 1.62
A. Experimental results
b03
b04 2.86 2.46 1.27 1.67 1.00 1.67 1.50 1.40 1.67 1.67
b05 3.91 2.84 1.00 1.00 1.00 1.00 1.51 1.39 1.68 1.68
b06 4.45 2.72 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4.60 3.48 1.23 1.23 1.00 1.68 1.45 1.45 1.68 1.68
261
262
Table A.62: Results for balance cost: β(H Π ), on target T6 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b03 2.23 1.62 1.00 1.00 1.00 1.62 1.62 1.62 1.62 1.62
b04 2.86 2.73 1.27 1.67 1.00 1.00 1.50 1.40 1.67 1.67
b05 3.91 2.94 1.00 1.00 1.00 1.00 1.51 1.39 1.68 1.68
b06 4.45 2.72 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b07 4.60 3.48 1.23 1.23 1.00 1.68 1.45 1.45 1.68 1.68
b08 3.16 3.16 1.00 1.00 1.00 1.00 1.38 1.54 1.54 1.54
b09 3.33 3.33 1.00 1.00 1.00 1.00 1.58 1.00 1.58 1.58
b10 3.88 3.88 1.48 1.48 1.00 1.00 1.48 1.48 1.96 1.96
b11 3.55 3.55 1.38 1.38 1.00 1.51 1.47 1.51 1.64 1.64
b12 3.13 2.67 1.56 1.56 1.09 1.09 1.54 1.37 1.65 1.56
b13 4.27 4.27 1.27 1.27 1.00 1.82 1.54 1.54 1.82 1.82
b14 4.08 4.07 1.37 1.37 1.04 1.62 1.61 1.61 1.61 1.62
b17 1.51 1.63 1.50 1.63 1.17 1.17 1.62 1.59 1.63 1.63
b18 1.49 1.49 1.48 1.48 1.25 1.25 1.61 1.58 1.62 1.62
b19 1.40 1.40 1.39 1.39 1.13 1.13 1.56 1.38 1.62 1.62
b20 1.58 1.63 1.58 1.58 1.17 1.17 1.59 1.52 1.63 1.63
b21 1.58 1.58 1.51 1.51 1.11 1.11 1.60 1.56 1.62 1.62
b22 1.57 1.62 1.59 1.59 1.16 1.16 1.59 1.45 1.63 1.63
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.63: Results for balance cost: β(H Π ), on target T1 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 5.07 5.04 2.14 2.25 1.25 1.25 2.23 2.10 2.09 2.09
mnist 3.15 3.07 1.00 1.39 1.49 2.26 1.77 1.52 2.26 2.26
mobilnet1 3.66 3.66 1.24 1.24 1.25 1.25 2.19 1.73 2.25 2.25
263
264
Table A.64: Results for balance cost: β(H Π ), on target T2 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 5.07 4.88 2.14 2.14 1.25 1.25 2.23 2.10 2.25 2.25
mnist 3.15 3.15 1.00 1.39 1.49 2.25 1.77 1.52 2.26 2.26
mobilnet1 3.66 3.66 1.24 1.24 1.25 1.25 2.19 1.93 2.25 2.25
OneCore 5.98 5.98 2.11 2.25 1.00 1.93 2.19 1.27 2.25 2.25
PuLSAR 4.91 4.91 1.84 1.84 1.30 1.30 2.13 1.80 2.25 2.25
WasgaServer 5.62 5.62 2.18 2.18 1.50 1.50 2.06 1.65 2.25 2.25
bitonic_mesh 3.38 3.21 1.00 1.00 1.25 1.25 1.74 1.35 2.22 2.22
cholesky_bdti 4.01 3.93 1.22 1.22 1.00 1.00 1.91 1.54 2.25 2.25
dart 1.81 1.81 1.00 1.00 1.25 1.25 2.21 2.11 2.25 2.25
denoise 2.95 2.95 1.73 1.73 1.25 1.25 2.17 2.05 2.25 2.25
des90 3.50 3.48 1.00 1.00 1.00 1.00 1.89 1.29 2.25 2.25
xge_mac 5.07 5.07 1.52 1.52 1.43 2.23 2.25 2.25 2.25 2.23
cholesky_mc 4.58 4.58 1.19 1.19 1.25 1.81 2.12 1.99 2.07 2.07
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.65: Results for balance cost: β(H Π ), on target T3 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 5.07 5.04 2.14 2.14 1.25 1.25 2.23 2.11 1.45 1.45
mnist 3.15 3.07 1.00 1.00 1.49 2.26 1.77 1.70 2.26 2.26
mobilnet1 3.66 3.66 1.24 1.24 1.25 1.25 2.19 1.78 2.25 2.25
265
266
Table A.66: Results for balance cost: β(H Π ), on target T4 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 4.48 4.48 1.57 1.57 1.08 1.08 1.61 1.56 1.63 1.63
mnist 1.96 1.96 1.30 1.30 1.24 1.63 1.61 1.47 1.63 1.63
mobilnet1 3.69 3.69 1.29 1.29 1.16 1.16 1.59 1.55 1.62 1.62
OneCore 4.87 4.87 1.56 1.56 1.16 1.62 1.61 1.55 1.62 1.62
PuLSAR 5.04 5.04 1.45 1.45 1.17 1.17 1.62 1.58 1.63 1.63
WasgaServer 4.14 4.14 1.59 1.59 1.17 1.17 1.57 1.50 1.63 1.63
bitonic_mesh 2.18 2.18 1.00 1.00 1.13 1.13 1.49 1.25 1.62 1.62
cholesky_bdti 2.62 2.62 1.31 1.31 1.08 1.08 1.55 1.47 1.63 1.63
dart 2.63 2.63 1.00 1.00 1.08 1.08 1.62 1.60 1.63 1.63
denoise 2.70 2.70 1.50 1.50 1.17 1.17 1.61 1.58 1.63 1.63
des90 3.07 3.07 1.19 1.19 1.08 1.08 1.52 1.28 1.63 1.63
xge_mac 3.16 3.16 1.30 1.30 1.16 1.61 1.61 1.59 1.61 1.61
cholesky_mc 4.02 4.02 1.26 1.12 1.08 1.08 1.49 1.42 1.63 1.63
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.67: Results for balance cost: β(H Π ), on target T5 for circuits in Chipyard and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 4.48 4.48 1.57 1.57 1.08 1.08 1.61 1.56 1.63 1.63
mnist 1.96 1.96 1.30 1.30 1.24 1.63 1.61 1.61 1.63 1.63
mobilnet1 3.69 3.69 1.29 1.29 1.16 1.16 1.59 1.48 1.62 1.62
267
268
Table A.68: Results for balance cost: β(H Π ), on target T6 for circuits in Chipyard and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 4.48 4.48 1.57 1.57 1.08 1.08 1.61 1.56 1.61 1.61
mnist 1.96 1.96 1.30 1.63 1.24 1.30 1.61 1.56 1.63 1.63
mobilnet1 3.69 3.69 1.29 1.29 1.16 1.16 1.59 1.48 1.62 1.62
OneCore 4.87 4.87 1.56 1.56 1.16 1.16 1.61 1.59 1.62 1.62
PuLSAR 5.04 5.04 1.45 1.45 1.17 1.17 1.62 1.58 1.63 1.63
WasgaServer 4.14 4.14 1.59 1.59 1.17 1.17 1.57 1.50 1.62 1.62
bitonic_mesh 2.18 1.96 1.00 1.00 1.13 1.13 1.49 1.25 1.62 1.62
cholesky_bdti 2.62 2.62 1.31 1.31 1.08 1.08 1.55 1.40 1.60 1.60
dart 2.63 2.63 1.00 1.00 1.08 1.08 1.62 1.60 1.62 1.62
denoise 2.70 2.70 1.50 1.50 1.17 1.17 1.61 1.58 1.63 1.63
des90 3.07 3.07 1.19 1.19 1.08 1.08 1.52 1.28 1.62 1.62
xge_mac 3.16 3.16 1.30 1.61 1.16 1.16 1.61 1.59 1.61 1.61
cholesky_mc 4.02 4.02 1.26 1.16 1.08 1.08 1.49 1.36 1.61 1.61
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 7.87 7.87 7.87 7.57 10.31 10.31 8.18 7.72 13.05 13.05
b02 12.31 12.31 12.31 12.13 9.09 9.09 15.18 15.18 12.31 12.31
b03 6.06 6.06 4.63 8.06 7.40 5.88 7.61 6.16 13.01 13.01
b04 4.03 4.03 3.35 3.53 3.35 3.35 3.36 2.37 11.42 11.42
b05 0.93 0.93 0.64 0.96 1.18 1.18 1.15 0.74 9.07 9.07
b06 11.96 11.96 12.31 12.13 12.31 12.31 14.98 12.31 15.36 15.36
b07 2.94 2.94 3.61 2.78 3.04 3.04 3.17 2.55 7.07 7.07
b08 4.98 4.98 5.22 5.22 5.83 5.83 5.11 4.98 7.29 7.29
b09 16.14 16.14 10.78 7.21 9.00 9.00 11.50 9.00 10.78 10.78
b10 8.23 8.23 6.78 6.78 8.23 8.23 7.76 6.86 9.59 9.59
b11 3.36 3.36 2.81 2.81 4.01 4.01 3.54 2.34 7.48 7.48
b12 4.16 4.16 3.24 3.76 3.24 3.24 3.98 3.24 10.10 10.10
b13 1.71 1.71 1.27 2.10 2.89 2.89 2.00 1.95 9.21 9.21
b14 2.12 2.12 2.11 2.42 2.39 2.39 2.17 1.56 11.35 11.35
b17 0.68 0.68 0.48 1.65 0.81 0.81 0.65 0.37 13.49 13.49
b18 0.10 0.21 0.21 0.21 0.11 0.11 0.21 0.21 8.15 8.15
b19 0.30 0.28 0.18 0.18 0.31 0.31 0.29 0.28 8.59 8.59
b20 4.01 1.52 2.17 2.17 2.24 2.24 1.98 1.03 11.73 11.73
b21 2.99 0.98 2.00 2.00 2.21 2.21 1.76 1.03 10.03 10.03
b22 1.63 1.22 1.78 1.78 0.88 0.88 1.52 0.47 12.27 12.26
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.70: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T2 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 10.31 10.31 13.05 17.63 12.75 12.75 10.59 10.01 20.37 20.37
b02 15.18 15.18 18.05 15.36 18.41 12.31 17.01 15.18 24.33 24.33
7.87 7.87 7.87 9.68 7.68 7.68 9.56 9.30 12.72 12.72
A. Experimental results
b03
b04 4.84 4.84 4.07 3.99 5.76 5.76 4.46 2.65 9.61 9.61
b05 1.22 1.22 1.27 0.96 1.22 1.22 1.50 1.17 5.62 5.62
b06 21.46 21.46 15.18 12.13 15.18 15.18 19.22 18.23 27.20 27.20
3.00 3.00 3.83 3.11 3.04 3.04 3.75 2.74 12.33 11.27
271
272
Table A.71: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T3 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 5.28 5.28 7.87 7.42 7.72 7.72 7.42 7.42 10.31 10.31
b02 9.26 9.26 9.26 9.26 6.21 6.21 12.13 12.13 12.31 12.31
b03 4.54 4.54 4.63 6.44 4.45 4.45 6.97 6.16 11.39 11.39
b04 3.03 3.03 2.15 2.79 3.03 3.03 2.69 2.05 4.23 4.23
b05 0.93 0.93 0.64 0.64 1.18 1.18 1.06 0.74 5.91 5.82
b06 9.26 9.26 6.21 6.21 6.21 6.21 11.56 9.09 12.31 12.31
b07 2.40 2.40 2.52 2.21 2.74 2.74 2.44 2.01 5.85 5.34
b08 2.90 2.90 4.18 4.18 3.58 3.58 3.82 2.90 7.29 7.29
b09 10.78 10.78 9.00 5.42 7.21 7.21 9.17 9.00 10.78 10.78
b10 6.86 6.86 5.42 5.42 6.86 6.86 6.31 5.19 7.99 7.99
b11 1.76 1.76 1.93 1.93 3.01 3.01 2.21 1.84 3.18 3.18
b12 3.29 3.29 2.36 2.36 2.16 1.79 3.22 2.36 6.80 5.92
b13 0.88 0.88 1.27 2.10 2.05 2.05 2.00 1.95 3.37 3.37
b14 1.84 1.84 1.57 1.99 1.80 1.80 1.49 1.15 4.63 4.63
b17 0.68 0.68 0.48 0.75 0.81 0.81 0.65 0.37 5.56 5.56
b18 0.10 0.11 0.10 0.10 0.11 0.11 0.12 0.10 3.52 3.52
b19 0.20 0.18 0.18 0.18 0.31 0.31 0.21 0.20 4.02 4.01
b20 2.48 1.11 2.17 2.17 1.99 1.99 1.41 1.02 6.02 6.01
b21 1.74 0.98 2.00 2.00 1.96 1.96 1.53 1.03 5.65 5.42
b22 1.12 0.78 0.86 0.86 0.70 0.70 0.85 0.45 5.93 5.70
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.72: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T4 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15.49 15.49 15.49 15.64 17.93 17.93 20.86 15.49 30.74 30.74
b02 18.41 18.41 21.28 27.55 24.51 24.51 31.09 24.33 21.28 21.28
12.54 12.54 21.10 14.53 21.10 21.10 17.81 11.01 20.91 20.91
A. Experimental results
b03
b04 7.84 7.84 8.98 7.43 7.39 7.39 7.28 5.94 9.65 9.65
b05 3.67 3.67 2.88 3.48 3.69 3.69 4.23 2.80 5.58 5.58
b06 18.41 18.41 24.33 30.25 21.46 21.46 30.48 24.33 42.80 42.80
7.46 7.46 7.10 8.16 7.87 6.79 7.66 6.01 11.33 11.33
273
274
Table A.73: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T5 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 20.52 20.52 23.11 20.67 20.52 20.52 25.61 20.67 38.81 38.81
b02 18.41 18.41 30.60 24.33 27.55 27.55 32.68 24.51 30.43 30.43
b03 15.77 15.77 22.72 11.20 22.72 22.72 16.57 14.53 32.24 30.71
b04 12.06 12.06 9.61 10.04 9.30 9.30 9.13 7.32 9.44 9.30
b05 4.34 4.34 3.38 4.74 3.69 3.69 4.90 3.43 8.42 8.42
b06 36.70 36.70 30.43 27.38 21.46 21.46 35.34 24.51 21.28 21.28
b07 6.94 6.94 10.88 10.34 8.93 8.93 9.25 7.42 14.99 14.99
b08 15.57 15.57 16.48 19.59 15.45 15.45 19.50 15.45 19.71 19.71
b09 26.85 26.85 19.71 21.50 30.43 30.43 22.57 17.51 25.07 25.07
b10 21.69 21.69 19.05 14.80 17.76 17.76 21.09 17.68 16.24 16.16
b11 6.73 6.73 8.78 6.79 6.20 6.20 7.83 5.91 12.95 12.95
b12 12.01 12.01 11.18 12.84 17.12 17.12 10.73 9.07 10.15 10.15
b13 2.93 2.93 5.15 5.05 6.87 6.87 7.21 4.60 6.66 6.66
b14 5.69 5.69 6.45 5.00 7.69 7.69 6.68 4.81 9.31 9.31
b17 1.51 2.29 2.60 2.74 4.21 4.21 2.79 1.54 7.07 7.07
b18 0.38 0.42 0.98 0.98 1.32 1.32 0.66 0.42 5.52 5.11
b19 0.89 1.00 0.96 0.96 0.68 0.68 0.81 0.69 4.04 4.04
b20 5.59 5.01 5.98 5.98 6.25 6.25 5.45 4.24 8.45 8.45
b21 4.27 5.00 4.46 4.46 5.71 5.71 4.63 3.59 11.11 11.11
b22 4.49 4.36 5.05 5.05 4.72 4.72 4.79 3.47 7.34 7.34
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.74: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T6 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 7.87 7.87 7.87 7.87 7.72 7.72 9.93 7.72 15.64 15.64
b02 9.26 9.26 12.31 12.31 9.26 9.26 13.17 12.13 15.36 15.36
4.92 4.92 8.15 8.15 8.15 8.15 8.10 6.53 11.29 11.29
A. Experimental results
b03
b04 4.23 4.23 4.23 4.23 3.35 3.35 3.58 2.72 8.98 8.94
b05 1.81 1.81 1.80 1.80 1.83 1.83 1.82 1.22 5.77 5.47
b06 12.31 12.31 9.26 9.26 9.26 9.26 13.76 12.13 12.13 12.13
4.19 4.19 4.89 4.89 4.06 4.06 4.06 3.29 7.65 7.65
275
276
Table A.75: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T1 for circuits in Chipyard and
Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.62 0.62 1.66 1.66 3.91 3.91 0.49 0.03 13.06 12.92
mnist 9.02 5.69 5.97 5.97 10.83 10.83 4.97 2.45 17.86 17.86
mobilnet1 1.82 1.82 1.72 1.72 6.78 6.78 1.83 1.10 10.08 10.08
OneCore 1.25 1.25 0.11 0.11 2.40 2.40 0.47 0.01 10.37 9.95
PuLSAR 0.49 0.49 0.09 0.09 3.82 3.82 0.39 0.08 11.30 11.30
WasgaServer 0.03 0.03 0.67 0.67 4.92 4.92 0.38 0.01 13.19 13.19
bitonic_mesh 0.71 0.71 1.42 1.42 12.70 12.70 0.71 0.71 12.07 12.07
cholesky_bdti 1.90 1.90 1.90 1.90 4.83 4.83 2.10 1.13 10.96 10.96
dart 1.97 1.97 1.69 1.69 2.79 2.79 2.75 1.41 6.71 6.71
denoise 0.00 0.00 0.00 0.00 0.07 0.07 0.00 0.00 6.85 6.85
des90 2.94 2.94 1.42 1.42 11.27 11.27 0.95 0.58 12.21 12.21
xge_mac 2.50 2.50 3.98 3.98 6.77 6.77 4.28 3.98 8.67 8.59
cholesky_mc 1.08 1.08 2.00 2.00 4.88 4.88 1.67 0.97 12.87 12.87
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.76: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T2 for circuits in Chipyard and
Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.84 0.84 2.08 2.08 3.92 3.92 0.80 0.36 20.92 20.92
mnist 9.02 8.93 5.97 5.97 17.49 17.49 5.16 2.73 30.71 30.71
mobilnet1 1.82 1.82 2.82 2.82 7.99 7.99 1.83 1.10 10.01 10.01
277
278
Table A.77: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T3 for circuits in Chipyard and
Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.15 0.15 0.46 0.46 1.61 1.61 0.08 0.01 8.68 8.27
mnist 4.17 4.17 2.92 2.92 6.53 6.53 2.78 2.45 11.11 11.11
mobilnet1 1.82 1.82 1.72 1.72 3.11 3.11 1.83 1.10 6.34 6.34
OneCore 0.32 0.32 0.11 0.11 1.96 1.96 0.17 0.01 6.60 6.60
PuLSAR 0.25 0.25 0.08 0.08 2.61 2.61 0.10 0.01 7.78 7.78
WasgaServer 0.01 0.01 0.47 0.47 2.56 2.56 0.08 0.01 7.82 7.82
bitonic_mesh 0.03 0.03 1.42 1.42 8.14 8.14 0.03 0.03 9.03 9.03
cholesky_bdti 0.97 0.97 0.97 0.97 3.25 3.25 0.90 0.21 6.18 6.18
dart 1.41 1.41 1.69 1.69 1.40 1.40 2.14 1.13 6.72 6.72
denoise 0.00 0.00 0.00 0.00 0.07 0.07 0.00 0.00 1.36 1.36
des90 1.42 1.42 1.42 1.42 7.20 7.20 0.18 0.03 8.22 8.22
xge_mac 2.50 2.50 3.98 3.98 5.72 5.72 4.28 3.98 7.02 7.02
cholesky_mc 1.08 1.08 1.13 1.13 3.09 3.09 0.92 0.26 7.98 7.98
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.78: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T4 for circuits in Chipyard and
Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.62 0.62 3.21 3.21 8.36 8.36 2.02 0.85 16.55 16.13
mnist 11.01 11.01 9.39 9.39 19.29 19.29 9.81 5.88 22.06 22.06
mobilnet1 5.35 5.35 4.95 4.95 12.57 12.57 4.19 1.83 13.05 13.05
279
280
Table A.79: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T5 for circuits in Chipyard and
Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.99 0.99 3.83 3.83 8.57 8.57 2.33 0.85 29.35 29.35
mnist 12.63 12.63 12.54 12.54 28.81 28.81 14.24 9.30 33.76 33.76
mobilnet1 7.22 7.22 6.75 6.75 21.93 21.93 4.88 3.08 18.74 18.11
OneCore 2.18 2.18 2.24 2.24 5.39 5.39 2.76 1.54 10.37 9.46
PuLSAR 2.89 2.66 1.40 1.40 7.91 7.91 2.71 1.90 18.27 18.27
WasgaServer 0.84 0.84 1.06 1.06 14.21 14.21 1.18 0.43 24.87 24.87
bitonic_mesh 2.94 2.94 4.46 4.46 39.47 39.47 2.35 2.10 31.01 31.01
cholesky_bdti 4.67 4.67 4.67 4.67 19.78 19.78 4.22 2.82 16.46 16.46
dart 5.60 5.60 5.04 5.04 6.29 6.29 5.80 4.21 12.31 12.31
denoise 0.00 0.00 0.03 0.03 3.21 3.21 0.00 0.00 12.17 12.08
des90 3.67 3.67 4.46 4.46 19.24 19.24 2.34 2.23 18.92 18.92
xge_mac 13.28 13.20 13.20 13.20 21.64 21.64 14.45 8.42 29.64 29.64
cholesky_mc 4.78 4.78 6.62 6.62 17.17 17.17 4.52 2.82 16.46 16.46
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.80: Results of DKFMFAST effects on critical path: dΠ
max (H), on target T6 for circuits in Chipyard and
Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 0.01 0.01 1.12 1.12 2.67 2.67 0.20 0.02 9.74 9.74
mnist 4.54 4.54 4.54 4.54 9.39 9.39 5.35 4.54 11.01 9.77
mobilnet1 2.23 2.23 1.83 1.83 6.34 6.34 1.92 0.58 5.72 5.72
281
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Connectivity cost
Results on connectivity cost presented in Chapter 6 are based on tables introduced
in this subsection. Each table shows us effect of DKFMFAST on connectivity cost
of partition produced by min-cut algorithms. These results have been used to
make the figures presenting the relative connectivity cost of each partition. The
following figures corresponds to the complementary results of the one presented in
Chapter 6.
282 J. Rodriguez
hmetis
hmetis+dkfmfast
8 khmetis
khmetis+dkfmfast
patoh
A. Experimental results
6 patoh+dkfmfast
kahypar
kahypar+dkfmfast
topopart
283
284
hmetis
8 hmetis+dkfmfast
khmetis
khmetis+dkfmfast
patoh
6 patoh+dkfmfast
kahypar
kahypar+dkfmfast
topopart
4 topopart+dkfm
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
8 hmetis
hmetis+dkfmfast
7 khmetis
khmetis+dkfmfast
6 patoh
A. Experimental results
patoh+dkfmfast
5 kahypar
kahypar+dkfmfast
4 topopart
285
286
hmetis
8 hmetis+dkfmfast
khmetis
khmetis+dkfmfast
patoh
6 patoh+dkfmfast
kahypar
kahypar+dkfmfast
topopart
4 topopart+dkfm
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.81: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T1 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15 15 19 19 20 20 14 15 30 30
b02 14 14 17 17 17 17 12 14 19 19
18 18 29 29 22 24 18 19 58 58
A. Experimental results
b03
b04 43 43 43 43 50 50 44 41 214 214
b05 33 33 37 37 39 39 41 34 328 328
b06 18 18 23 23 26 26 18 19 30 30
40 40 50 50 52 52 45 40 133 133
287
288
Table A.82: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T2 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15 15 19 19 20 20 14 15 30 30
b02 14 14 17 17 17 16 12 14 18 18
b03 18 18 29 29 22 22 18 19 64 64
b04 43 43 43 43 50 50 44 41 175 175
b05 33 33 37 37 39 39 41 34 241 241
b06 18 18 23 23 26 26 18 19 32 32
b07 40 40 50 50 52 52 45 40 137 146
b08 29 29 35 35 36 36 29 31 73 73
b09 24 24 23 23 29 29 24 26 78 78
b10 35 35 37 37 46 46 32 34 90 90
b11 45 45 58 58 60 60 52 47 190 190
b12 40 40 42 42 44 44 40 39 286 286
b13 2 2 5 5 6 6 6 7 65 65
b14 228 228 309 309 375 375 293 226 2243 2243
b17 153 153 187 187 236 236 227 155 5671 6149
b18 100 100 141 141 143 143 241 156 8231 8231
b19 375 375 286 286 145 145 380 306 20779 20779
b20 313 313 300 300 425 425 339 255 4861 4861
b21 296 296 297 297 391 391 284 235 4021 4422
b22 365 365 354 354 348 348 366 299 6264 6264
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.83: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T3 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 15 15 19 19 20 20 14 15 24 24
b02 14 14 17 17 17 17 12 14 18 18
18 18 29 29 22 22 18 19 40 40
A. Experimental results
b03
b04 43 43 43 43 50 50 44 41 120 120
b05 33 33 37 37 39 39 41 34 206 226
b06 18 18 23 23 26 26 18 19 25 25
40 40 50 50 52 52 45 40 121 125
289
290
Table A.84: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T4 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 25 25 34 34 34 34 25 28 39 39
b02 24 24 27 27 25 25 21 23 31 31
b03 34 34 52 52 45 45 33 36 94 94
b04 74 74 91 91 104 104 79 74 244 244
b05 63 63 73 73 74 74 75 63 230 230
b06 30 30 39 39 41 41 29 30 44 44
b07 65 65 89 89 92 97 67 62 155 155
b08 51 51 64 64 61 61 47 47 106 106
b09 44 48 52 52 51 53 39 42 101 101
b10 62 62 72 72 85 85 57 59 118 118
b11 73 82 99 99 92 92 81 72 216 216
b12 81 81 86 86 93 93 81 80 307 317
b13 9 9 14 14 17 17 15 13 69 69
b14 437 437 463 463 637 637 513 380 2330 2330
b17 457 457 563 563 538 538 522 382 5497 5863
b18 273 273 390 390 292 292 395 290 5305 5305
b19 523 523 479 479 376 376 605 485 12863 14835
b20 630 630 653 653 811 811 607 472 3908 3908
b21 643 643 626 626 884 884 624 499 4752 5038
b22 779 779 803 803 983 983 693 537 6691 6691
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.85: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T5 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 25 25 34 34 34 34 25 28 37 37
b02 24 24 27 27 25 25 21 23 30 30
34 34 52 52 45 45 33 36 85 91
A. Experimental results
b03
b04 74 74 91 91 104 104 79 74 263 260
b05 63 63 73 73 74 74 75 63 214 214
b06 30 30 39 39 41 41 29 30 43 43
65 65 89 89 92 92 67 62 164 164
291
292
Table A.86: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T6 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 25 25 34 34 34 34 25 28 45 45
b02 24 24 27 27 25 25 21 23 33 33
b03 34 34 52 52 45 45 33 36 73 73
b04 74 74 91 91 104 104 79 74 245 255
b05 63 63 73 73 74 74 75 63 250 275
b06 30 30 39 39 41 41 29 30 41 41
b07 65 65 89 89 92 92 67 62 142 142
b08 51 51 64 64 61 61 47 47 83 83
b09 44 44 52 52 51 51 39 42 86 86
b10 62 62 72 72 85 85 57 59 99 99
b11 73 76 99 99 92 92 81 72 225 225
b12 81 81 86 86 93 93 81 80 290 300
b13 9 9 14 14 17 17 15 13 87 87
b14 437 437 463 463 637 637 513 380 2363 2363
b17 457 457 563 563 538 538 522 382 5729 5867
b18 273 273 390 390 292 292 395 290 8302 8302
b19 523 523 479 479 376 376 605 485 16437 16437
b20 630 630 653 653 811 811 607 472 4740 4949
b21 643 643 626 626 884 884 624 499 4347 4347
b22 779 779 803 803 983 983 693 537 6241 6241
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.87: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T1 for circuits in Chipyard
and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 339 339 546 546 16334 16334 471 375 72204 72204
mnist 59 78 76 76 1558 1558 47 49 5266 5266
mobilnet1 124 124 134 134 21324 21324 174 125 71450 72379
293
294
Table A.88: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T2 for circuits in Chipyard
and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 339 339 546 546 16334 16334 471 375 69043 69043
mnist 59 78 76 76 1558 1558 47 49 4701 4701
mobilnet1 124 124 134 134 21324 21324 174 125 54896 54896
OneCore 291 291 354 354 3662 3961 284 254 11244 11670
PuLSAR 550 550 673 3307 24496 24496 738 451 58753 59050
WasgaServer 433 433 565 565 91152 91152 550 475 273364 273364
bitonic_mesh 211 211 231 231 71051 71051 180 184 90752 90752
cholesky_bdti 309 309 289 289 24165 24165 318 307 58323 58323
dart 231 231 220 220 8449 8449 279 241 11980 11980
denoise 8 8 9 9 611 611 63 7 55994 57439
des90 172 172 215 215 33674 33674 155 158 48460 48460
xge_mac 84 84 92 92 647 647 83 89 1433 1433
cholesky_mc 170 170 203 203 12474 12474 244 208 20981 20981
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.89: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T3 for circuits in Chipyard
and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 339 339 546 546 16334 16334 471 375 40149 41016
mnist 59 59 76 76 1558 1558 47 49 3929 3929
mobilnet1 124 124 134 134 21324 21324 174 125 57561 57561
295
296
Table A.90: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T4 for circuits in Chipyard
and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 553 553 1066 1066 32142 32142 724 636 63339 63339
mnist 144 144 203 203 3370 3370 153 157 4850 4850
mobilnet1 261 261 353 353 34894 34894 402 364 80351 80351
OneCore 697 697 884 884 6266 6266 712 571 12495 13083
PuLSAR 1506 1506 1813 1813 38721 38721 1670 1257 86575 86893
WasgaServer 752 752 907 907 163202 163202 773 671 254302 254302
bitonic_mesh 441 441 499 499 131346 133836 383 397 100498 100498
cholesky_bdti 667 667 672 672 60250 60250 731 733 69748 69748
dart 397 397 443 443 13440 13440 434 390 19873 20834
denoise 14 14 169 169 4536 4536 105 15 34050 34324
des90 379 407 423 423 53545 53545 357 370 50801 50801
xge_mac 222 222 297 297 1133 1133 220 228 1414 1414
cholesky_mc 291 291 394 394 20750 20750 321 305 28614 28614
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.91: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T5 for circuits in Chipyard
and Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 553 553 1066 1066 32142 32142 724 636 72999 73624
mnist 144 144 203 203 3370 3370 153 157 5642 5642
mobilnet1 261 261 353 353 34894 34894 402 364 77503 78250
297
298
Table A.92: Results of DKFMFAST effects on connectivity cost: fλ (H Π ) × 103 , on target T6 for circuits in Chipyard
and Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 553 553 1066 1066 32142 32142 724 636 48723 48723
mnist 144 144 203 203 3370 3370 153 157 4748 4728
mobilnet1 261 261 353 353 34894 34894 402 364 66785 66785
OneCore 697 697 884 884 6266 6266 712 571 13627 13627
PuLSAR 1506 1506 1813 1813 38721 38721 1670 1257 72032 72032
WasgaServer 752 752 907 907 163202 163202 773 671 147895 147895
bitonic_mesh 441 441 499 499 131346 131346 383 397 46519 46519
cholesky_bdti 667 667 672 1176 60250 60250 731 723 47576 47576
dart 397 397 443 443 13440 13440 434 390 11958 11958
denoise 14 14 169 169 4536 4536 105 15 3186 3186
des90 379 379 423 423 53545 53545 357 370 40545 40545
xge_mac 222 342 297 297 1133 1133 220 228 1125 1125
cholesky_mc 291 291 394 394 20750 20750 321 305 19019 19019
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
A. Experimental results
Balance cost
In Chapter 6, we presented a comparison on vertex weight balance of partition.
For this purpose, we use the results presented in the following tables. Each table
shows us effect of DKFMFAST on balance cost of partition produced by min-cut
algorithms. The following figures corresponds to the complementary results of the
one presented in Chapter 6.
|V|
6
(H) × 100
hmetis+dkfmfast
khmetis
khmetis+dkfmfast
5 patoh
patoh+dkfmfast
kahypar
4 kahypar+dkfmfast
topopart
topopart+dkfmfast
3
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
hmetis
|V|
6
(H) × 100
hmetis+dkfmfast
khmetis
khmetis+dkfmfast
5 patoh
A. Experimental results
patoh+dkfmfast
kahypar
4 kahypar+dkfmfast
301
302
5.0 hmetis
|V|
(H) × 100
hmetis+dkfmfast
4.5 khmetis
khmetis+dkfmfast
4.0 patoh
patoh+dkfmfast
3.5 kahypar
kahypar+dkfmfast
3.0 topopart
topopart+dkfmfast
2.5
2.0
1.5
1.0
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
5.0 hmetis
|V|
(H) × 100
hmetis+dkfmfast
4.5 khmetis
khmetis+dkfmfast
4.0 patoh
A. Experimental results
patoh+dkfmfast
3.5 kahypar
kahypar+dkfmfast
303
304
Table A.93: Results of DKFMFAST effects on balance cost: β(H Π ), on target T1 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 2.96 2.96
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b03 1.00 1.00 1.00 1.00 1.00 1.00 1.62 1.62 2.23 2.23
b04 1.93 1.93 1.53 1.00 1.00 1.00 2.01 1.13 2.07 2.07
b05 4.30 4.30 1.00 1.10 1.19 1.19 1.79 1.10 1.97 1.97
b06 6.17 6.17 1.00 1.00 1.00 1.00 1.00 1.00 2.72 2.72
b07 6.18 6.18 1.68 1.23 1.23 1.23 2.04 1.68 2.35 2.35
b08 3.16 3.16 1.54 1.54 1.00 1.00 2.08 2.08 2.08 2.08
b09 5.65 5.65 1.58 1.58 1.00 1.00 2.04 1.58 2.74 2.74
b10 4.37 4.37 1.00 1.48 1.48 1.48 1.58 1.48 1.96 1.96
b11 5.09 5.09 1.38 1.00 1.26 1.26 1.97 1.51 2.28 2.28
b12 4.71 4.71 1.93 1.83 1.00 1.00 2.07 1.93 2.11 2.11
b13 4.54 4.54 1.82 1.54 1.00 1.00 1.51 1.27 2.36 2.36
b14 5.56 5.56 2.15 2.14 1.26 1.26 2.08 1.65 2.23 2.23
b17 4.13 4.13 2.14 2.09 1.49 1.49 2.22 2.06 1.82 1.82
b18 1.96 2.32 1.97 1.97 1.50 1.50 2.15 2.12 2.20 2.20
b19 1.98 2.42 1.16 1.16 1.26 1.26 2.19 1.83 1.11 1.11
b20 2.13 5.75 1.67 1.67 1.42 1.42 2.20 2.12 2.18 2.18
b21 1.68 5.35 1.13 1.13 1.31 1.31 1.98 1.41 2.23 2.23
b22 1.64 6.03 2.18 2.18 1.35 1.35 2.18 2.05 1.94 2.25
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.94: Results of DKFMFAST effects on balance cost: β(H Π ), on target T2 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 2.96 2.96
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00 1.62 1.62 2.23 2.23
A. Experimental results
b03
b04 1.93 1.93 1.53 1.00 1.00 1.00 2.01 1.13 2.33 2.33
b05 4.30 4.30 1.00 1.10 1.19 1.19 1.79 1.10 2.26 2.26
b06 6.17 6.17 1.00 1.00 1.00 1.00 1.00 1.00 2.72 2.72
6.18 6.18 1.68 1.23 1.23 1.23 2.04 1.68 2.35 2.35
305
306
Table A.95: Results of DKFMFAST effects on balance cost: β(H Π ), on target T3 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 2.96 2.96
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b03 1.00 1.00 1.00 1.00 1.00 1.00 1.62 1.62 2.23 2.23
b04 1.93 1.93 1.53 1.00 1.00 1.00 2.01 1.13 2.20 2.20
b05 4.30 4.30 1.00 1.10 1.19 1.19 1.79 1.10 2.26 2.16
b06 6.17 6.17 1.00 1.00 1.00 1.00 1.00 1.00 2.72 2.72
b07 6.18 6.18 1.68 1.23 1.23 1.23 2.04 1.68 2.35 2.35
b08 3.16 3.16 1.54 1.54 1.00 1.00 2.08 2.08 1.54 1.54
b09 5.65 5.65 1.58 1.58 1.00 1.00 2.04 1.58 2.74 2.74
b10 4.37 4.37 1.00 1.48 1.48 1.48 1.58 1.48 2.44 2.44
b11 5.09 5.09 1.38 1.00 1.26 1.26 1.97 1.51 2.28 2.28
b12 4.71 4.71 1.93 1.83 1.00 1.19 2.07 1.93 2.21 1.83
b13 4.54 4.54 1.82 1.54 1.00 1.00 1.51 1.27 2.09 2.09
b14 5.56 5.56 2.15 2.14 1.26 1.26 2.08 1.65 2.25 2.25
b17 4.13 4.13 2.14 2.09 1.49 1.49 2.22 2.06 2.25 2.25
b18 1.96 2.32 1.97 1.97 1.50 1.50 2.15 2.12 2.25 2.25
b19 1.98 2.42 1.16 1.16 1.26 1.26 2.19 1.83 2.13 2.25
b20 2.13 5.75 1.67 1.67 1.42 1.42 2.20 2.12 2.22 2.25
b21 1.68 5.35 1.13 1.13 1.31 1.31 1.98 1.41 2.25 2.25
b22 1.64 6.03 2.18 2.18 1.35 1.35 2.18 1.97 2.25 2.25
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.96: Results of DKFMFAST effects on balance cost: β(H Π ), on target T4 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
2.23 2.23 1.00 1.62 1.00 1.00 1.62 1.62 1.62 1.62
A. Experimental results
b03
b04 2.86 2.86 1.27 1.27 1.00 1.00 1.50 1.40 1.67 1.67
b05 3.91 3.91 1.00 1.00 1.00 1.00 1.51 1.39 1.68 1.68
b06 4.45 4.45 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4.60 4.60 1.23 1.23 1.00 1.23 1.45 1.45 1.68 1.68
307
308
Table A.97: Results of DKFMFAST effects on balance cost: β(H Π ), on target T5 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b03 2.23 2.23 1.00 1.00 1.00 1.00 1.62 1.62 1.62 1.62
b04 2.86 2.86 1.27 1.27 1.00 1.00 1.50 1.40 1.67 1.67
b05 3.91 3.91 1.00 1.00 1.00 1.00 1.51 1.29 1.68 1.68
b06 4.45 4.45 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b07 4.60 4.60 1.23 1.23 1.00 1.00 1.45 1.45 1.68 1.68
b08 3.16 3.16 1.00 1.00 1.00 1.00 1.38 1.00 1.54 1.54
b09 3.33 3.33 1.00 1.00 1.00 1.00 1.58 1.58 1.58 1.58
b10 3.88 3.88 1.48 1.48 1.00 1.00 1.48 1.48 1.96 1.96
b11 3.55 3.55 1.38 1.38 1.00 1.00 1.47 1.38 1.64 1.64
b12 3.13 3.13 1.56 1.56 1.09 1.09 1.54 1.37 1.65 1.65
b13 4.27 4.27 1.27 1.27 1.00 1.00 1.54 1.54 1.82 1.82
b14 4.08 4.08 1.37 1.54 1.04 1.04 1.61 1.56 1.62 1.62
b17 1.51 3.95 1.50 1.51 1.17 1.17 1.62 1.61 1.63 1.63
b18 1.49 2.89 1.48 1.48 1.25 1.25 1.61 1.58 1.63 1.63
b19 1.40 1.95 1.39 1.39 1.13 1.13 1.56 1.38 1.62 1.62
b20 1.58 3.78 1.58 1.58 1.17 1.17 1.59 1.52 1.63 1.63
b21 1.58 3.59 1.51 1.51 1.11 1.11 1.60 1.56 1.63 1.63
b22 1.57 3.47 1.59 1.59 1.16 1.16 1.59 1.45 1.63 1.63
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.98: Results of DKFMFAST effects on balance cost: β(H Π ), on target T6 for circuits in ITC set.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
b01 4.92 4.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
b02 4.33 4.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
2.23 2.23 1.00 1.00 1.00 1.00 1.62 1.62 1.62 1.62
A. Experimental results
b03
b04 2.86 2.86 1.27 1.27 1.00 1.00 1.50 1.40 1.67 1.67
b05 3.91 3.91 1.00 1.00 1.00 1.00 1.51 1.39 1.68 1.68
b06 4.45 4.45 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4.60 4.60 1.23 1.23 1.00 1.00 1.45 1.45 1.68 1.68
309
310
Table A.99: Results of DKFMFAST effects on balance cost: β(H Π ), on target T1 for circuits in Chipyard and Titan
sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 5.07 5.07 2.14 2.14 1.25 1.25 2.23 2.11 2.09 1.89
mnist 3.15 3.14 1.00 1.00 1.49 1.49 1.77 1.70 2.26 2.26
mobilnet1 3.66 3.66 1.24 1.24 1.25 1.25 2.19 2.00 2.25 2.25
OneCore 5.98 5.98 2.11 2.11 1.00 1.00 2.19 2.18 1.93 2.25
PuLSAR 4.91 4.91 1.84 1.84 1.30 1.30 2.13 1.80 2.25 2.25
WasgaServer 5.62 5.62 2.18 2.18 1.50 1.50 2.06 1.65 2.24 2.24
bitonic_mesh 3.38 3.38 1.00 1.00 1.25 1.25 1.74 1.35 2.14 2.14
cholesky_bdti 4.01 4.01 1.22 1.22 1.00 1.00 1.91 1.54 2.11 2.11
dart 1.81 1.81 1.00 1.00 1.25 1.25 2.21 1.87 2.25 2.25
denoise 2.95 2.95 1.73 1.73 1.25 1.25 2.17 2.05 2.24 2.23
des90 3.50 3.50 1.00 1.00 1.00 1.00 1.89 1.29 2.25 2.25
xge_mac 5.07 5.07 1.52 1.52 1.43 1.43 2.25 2.25 2.25 2.25
cholesky_mc 4.58 4.58 1.19 1.19 1.25 1.25 2.12 1.96 1.90 1.90
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.100: Results of DKFMFAST effects on balance cost: β(H Π ), on target T2 for circuits in Chipyard and
Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 5.07 5.07 2.14 2.14 1.25 1.25 2.23 2.11 2.25 2.25
mnist 3.15 3.14 1.00 1.00 1.49 1.49 1.77 1.70 2.26 2.26
mobilnet1 3.66 3.66 1.24 1.24 1.25 1.25 2.19 2.00 2.25 2.25
311
312
Table A.101: Results of DKFMFAST effects on balance cost: β(H Π ), on target T3 for circuits in Chipyard and
Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 5.07 5.07 2.14 2.14 1.25 1.25 2.23 2.11 1.45 1.43
mnist 3.15 3.15 1.00 1.00 1.49 1.49 1.77 1.72 2.26 2.26
mobilnet1 3.66 3.66 1.24 1.24 1.25 1.25 2.19 2.00 2.25 2.25
OneCore 5.98 5.98 2.11 2.11 1.00 1.00 2.19 2.18 2.25 2.25
PuLSAR 4.91 4.91 1.84 1.84 1.30 1.30 2.13 1.80 2.25 2.25
WasgaServer 5.62 5.62 2.18 2.18 1.50 1.50 2.06 1.65 2.24 2.24
bitonic_mesh 3.38 3.38 1.00 1.00 1.25 1.25 1.74 1.35 2.25 2.25
cholesky_bdti 4.01 4.01 1.22 1.22 1.00 1.00 1.91 1.54 1.55 1.55
dart 1.81 1.81 1.00 1.00 1.25 1.25 2.21 1.87 2.25 2.25
denoise 2.95 2.95 1.73 1.73 1.25 1.25 2.17 2.05 1.55 1.55
des90 3.50 3.50 1.00 1.00 1.00 1.00 1.89 1.29 2.25 2.25
xge_mac 5.07 5.07 1.52 1.52 1.43 1.43 2.25 2.25 2.25 2.25
cholesky_mc 4.58 4.58 1.19 1.19 1.25 1.25 2.12 1.96 1.87 1.87
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.102: Results of DKFMFAST effects on balance cost: β(H Π ), on target T4 for circuits in Chipyard and
Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 4.48 4.48 1.57 1.57 1.08 1.08 1.61 1.56 1.63 1.63
mnist 1.96 1.96 1.30 1.30 1.24 1.24 1.61 1.54 1.63 1.63
mobilnet1 3.69 3.69 1.29 1.29 1.16 1.16 1.59 1.48 1.62 1.62
313
314
Table A.103: Results of DKFMFAST effects on balance cost: β(H Π ), on target T5 for circuits in Chipyard and
Titan sets.
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 4.48 4.48 1.57 1.57 1.08 1.08 1.61 1.56 1.63 1.63
mnist 1.96 1.96 1.30 1.30 1.24 1.24 1.61 1.54 1.63 1.63
mobilnet1 3.69 3.69 1.29 1.29 1.16 1.16 1.59 1.48 1.62 1.62
OneCore 4.87 4.87 1.56 1.56 1.16 1.16 1.61 1.55 1.62 1.62
PuLSAR 5.04 5.04 1.45 1.45 1.17 1.17 1.62 1.58 1.63 1.63
WasgaServer 4.14 4.14 1.59 1.59 1.17 1.17 1.57 1.50 1.63 1.63
bitonic_mesh 2.18 2.18 1.00 1.00 1.13 1.13 1.49 1.25 1.62 1.62
cholesky_bdti 2.62 2.62 1.31 1.31 1.08 1.08 1.55 1.40 1.63 1.63
dart 2.63 2.63 1.00 1.00 1.08 1.08 1.62 1.60 1.63 1.63
denoise 2.70 2.70 1.50 1.50 1.17 1.17 1.61 1.58 1.63 1.63
des90 3.07 3.07 1.19 1.19 1.08 1.08 1.52 1.28 1.63 1.63
xge_mac 3.16 3.11 1.30 1.30 1.16 1.16 1.61 1.59 1.61 1.61
cholesky_mc 4.02 4.02 1.26 1.26 1.08 1.08 1.49 1.36 1.63 1.63
J. Rodriguez
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
Table A.104: Results of DKFMFAST effects on balance cost: β(H Π ), on target T6 for circuits in Chipyard and
Titan sets.
A. Experimental results
Instance hMetis +DKFM khMetis +DKFM PaToH +DKFM KaHyPar +DKFM TopoPart +DKFM
EightCore 4.48 4.48 1.57 1.57 1.08 1.08 1.61 1.56 1.61 1.61
mnist 1.96 1.96 1.30 1.30 1.24 1.24 1.61 1.54 1.63 1.63
mobilnet1 3.69 3.69 1.29 1.29 1.16 1.16 1.59 1.48 1.62 1.62
315
A.3. Numerical results of refinement algorithms: DKFM and DKFMFAST
316 J. Rodriguez
References
317
318 J. Rodriguez
References
[2] Cristinel Ababei and Kia Bazargan. Statistical timing driven partitioning
for VLSI circuits. In Proceedings 2002 of the Design, Automation and Test
in Europe Conference and Exhibition, page 1109. IEEE, 2002. 2.4.1
[3] Cristinel Ababei and Kia Bazargan. Timing minimization by statistical tim-
ing hMetis-based partitioning. Proceedings 2003 of the 16th International
Conference on VLSI Design., 2003. 2.4.1
[6] Yaroslav Akhremtsev, Tobias Heuer, Peter Sanders, and Sebastian Schlag.
Engineering a direct k -way hypergraph partitioning algorithm. In Pro-
ceedings of the 19th Workshop on Algorithm Engineering and Experiments,
(ALENEX 2017), pages 28–42, 2017. 2.3.3, 6.1.3
[7] Charles J. Alpert, Jen-Hsin Huang, and Andrew B. Kahng. Multilevel circuit
partitioning. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 17(8):655–667, August 1998. Conference Name: IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems.
1.1.2
319
References
[8] Charles J. Alpert and Andrew B. Kahng. Recent directions in netlist par-
titioning: a survey. Integration, 19(1):1–81, August 1995. 1.3.3, 1, 1.3.3,
1.3.3
[9] Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar
Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou,
Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao,
Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. Chipyard: Inte-
grated design, simulation, and implementation framework for custom socs.
IEEE Micro, 40(4):10–21, 2020. 3.2
[10] Robin Andre, Sebastian Schlag, and Christian Schulz. Memetic multilevel
hypergraph partitioning. In Proceedings of the Genetic and Evolutionary
Computation Conference, GECCO ’18, pages 347–354, 2018. 2.3.3
[12] Shawki Areibi. An integrated genetic algorithm with dynamic hill climbing
for VLSI circuit partitioning. In Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO) 2000, pages 97–102, 2000. 2.4.2
[13] Shawki Areibi and Anthony Vannelli. Circuit partitioning using a Tabu
search approach. In Proceedings 1993 IEEE International Symposium on
Circuits and Systems, pages 1643–1646 vol.3, May 1993. 2.4.2
[14] Shawki Areibi and Anthony Vannelli. Tabu search: A meta heuristic for
netlist partitioning. Proceedings of Very Large Scale Integration Design (VL-
SID), 11(3):259–283, 2000. 2.4.2
[15] Shawki Areibi and Zhen Yang. Effective Memetic Algorithms for VLSI De-
sign = Genetic Algorithms + Local Search + Multi-Level Clustering. Evo-
lutionary Computation, 12(3):327–353, September 2004. 2.4.2
[16] Cevdet Aykanat, Berkant Barla Cambazoglu, and Bora Uçar. Multi-level di-
rect k-way hypergraph partitioning with multiple constraints and fixed ver-
tices. Journal of Parallel and Distributed Computing, 68(5):609–625, 2008.
2.3.2, 6.1.3
[17] Raul Banos, Consolación Gil, Maria Dolores Gil Montoya, and Julio Or-
tega Lopera. A parallel evolutionary algorithm for circuit partitioning. In
Proceedings of the 11th Euromicro Conference on Parallel, Distributed and
Network-Based Processing, pages 365–371, 2003. 2.4.2
320 J. Rodriguez
References
[18] Ranieri Baraglia, Raffaele Perego, José Ignacio Hidalgo, Juan Lanchares,
and Francisco Tirado. A parallel compact genetic algorithm for multi-FPGA
partitioning. In Proceedings of the 9th Euromicro Workshop on Parallel and
Distributed Processing, PDP 2001, 7-9 February 2001, Mantova, Italy, pages
113–120. IEEE Computer Society, 2001. 2.4.2
[20] Dirk Behrens, Klaus Harbich, and Erich Barke. Hierarchical partitioning. In
Proceedings of International Conference on Computer Aided Design, pages
470–477. IEEE, 1996. 2.4.1
[21] Claude Berge. Graphs and hypergraphs. Elsevier Science Ltd., 1985. 1.1,
1.1.1, 1.1.2, 1.1.2
[23] Vaughn Betz and Jonathan Rose. VPR: A new packing, placement and
routing tool for FPGA research. In International Workshop on Field Pro-
grammable Logic and Applications, pages 213–222. Springer, 1997. 2.4.2,
3.2.2
[25] Pierre Bonami, Viet Hung Nguyen, Michel Klein, and Michel Minoux. On
the solution of a graph partitioning problem under capacity constraints. In
Proceedings of the Combinatorial Optimization: Second International Sym-
posium, ISCO 2012, Athens, Greece, April 19-21, 2012, Revised Selected
Papers 2, pages 285–296. Springer, 2012. 5.3, 5.3.2
[26] Daniel R. Brasen, Jean-Pierre Hiol, and Gabriele Saucier. Finding best
cones from random clusters for FPGA package partitioning. In Proceedings
of ASP-DAC’95/CHDL’95/VLSI’95 with EDA Technofair, pages 799–804.
IEEE, 1995. 2.4.1, 5.2.3
[27] Daniel R. Brasen and Gabriele Saucier. Using cone structures for circuit par-
titioning into FPGA packages. Journal of IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 17(7):592–600, 1998. 2.4.1,
5.2.3
[29] Thang Nguyen Bui and Curt Jones. A heuristic for reducing fill-in in sparse
matrix factorization. Technical report, Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1993. 2.4.1
[30] Raghu Burra and Dinesh Bhatia. Timing driven multi-FPGA board parti-
tioning. In Proceedings of the 11th International Conference on VLSI Design,
pages 234–237. IEEE, 1998. 5.2.2
[31] Ismail Bustany, Grigor Gasparyan, Andrew B. Kahng, Ioannis Koutis, Bod-
hisatta Pramanik, and Zhiang Wang. An open-source constraints-driven
general partitioning multi-tool for vlsi physical design. In In proceeding 2023
IEEE/ACM International Conference on Computer Aided Design (ICCAD),
pages 1–9, 2023. 2.3.4, 2.4.1, 2.4.1
[34] Ümit V. Çatalyürek and Cevdet Aykanat. PATOH: partitioning tool for hy-
pergraphs. In Encyclopedia of parallel computing, pages 1479–1487. Springer,
2011. 2.3.2, 2.4.1, 6.1.3
[35] Ümit V. Çatalyürek, Karen D. Devine, Marcelo Fonseca Faraj, Lars Gottes-
büren, Tobias Heuer, Henning Meyerhenke, Peter Sanders, Sebastian Schlag,
Christian Schulz, Daniel Seemaier, and Dorothea Wagner. More Recent Ad-
vances in (Hyper)Graph Partitioning. Technical Report arXiv:2205.13202,
arXiv, June 2022. arXiv:2205.13202 [cs]. 1.2.1, 2.4.1, 4.1.1, 4.2, 4.2.2, 5.1,
6.1.3
322 J. Rodriguez
References
[39] Yongseok Cheon and Martin Derek F. Wong. Design hierarchy guided multi-
level circuit partitioning. In Proceedings of the 2002 International Symposium
on Physical Design, pages 30–35, 2002. 2.4.1
[40] Cédric Chevalier and François Pellegrini. PT-Scotch: A tool for efficient
parallel graph ordering. Parallel Computing, 34(6-8):318–331, July 2008.
2.3.5
[42] Alberto Colorni, Marco Dorigo, and Vittorio Maniezzo. Distributed opti-
mization by ant colonies. 01 1991. 2.4.2
[43] Jason Cong, Honching Li, and Chang Wu. Simultaneous circuit partition-
ing/clustering with retiming for performance optimization. In Proceedings of
the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99, page
460–465, New York, NY, USA, 1999. Association for Computing Machinery.
2.4.1
[44] Jason Cong, Zheng Li, and Rajive Bagrodia. Acyclic multi-way partitioning
of boolean networks. In Proceedings of the 31st annual design automation
conference, pages 670–675, 1994. 2.4.1, 5.1
[45] Jason Cong and Sung Kyu Lim. Performance driven multiway partition-
ing. In Proceedings of the 37th Design Automation Conference. (IEEE Cat.
No.00CH37106), pages 441–446, January 2000. 2.4.1
[46] Jason Cong and M’Lissa Smith. A parallel bottom-up clustering algorithm
with applications to circuit partitioning in VLSI design. In Proceedings of
the 30th International Design Automation Conference, pages 755–760, 1993.
2.4.1
[50] Fulvio Corno, Matteo Sonza Reorda, and Giovanni Squillero. RT-level
ITC’99 benchmarks and first ATPG results. Journal of IEEE Design &
Test of computers, 17(3):44–53, 2000. 3.2, 3.2.1, 5.5.1
[51] Panayiotis Danassis, Kostas Siozios, and Dimitrios Soudris. ANT3D: Si-
multaneous partitioning and placement for 3-D FPGAs based on ant colony
optimization. Journal of IEEE Embedded Systems Letters, 8(2):41–44, 2016.
2.4.2
[55] Mehmet Deveci, Kamer Kaya, Bora Uçar, and Ümit V. Çatalyürek. Hy-
pergraph partitioning for multiple communication cost metrics: Model and
methods. Journal of Parallel and Distributed Computing, 77:69–83, 2015.
1.3.1
[56] Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and
Ümit V. Çatalyürek. Parallel hypergraph partitioning for scientific comput-
ing. In Proceedings 20th IEEE International Parallel & Distributed Process-
ing Symposium, pages 10–pp. IEEE, 2006. 6.1.3
324 J. Rodriguez
References
[57] Ajit A. Diwan, Sanjeeva Rane, Sridhar Seshadri, and Sundararajarao Su-
darshan. Clustering techniques for minimizing external path length. In
Proceedings of the 22th International Conference on Very Large Data Bases,
pages 342–353, 1996. 2.2.2, 4.1.3, 4.3.2
[59] Zola Nailah Donovan. Algorithmic Issues in some Disjoint Clustering Prob-
lems in Combinatorial Circuits. Thesis dissertation, West Virginia University
Libraries, January 2018. 2.4.1, 4.1.3, 4.2.1, 4.3.3, 4.3.3, 4.4.1, 4.4.2
[61] Zola Nailah Donovan, Kirubakran Subramani, and Vahan Mkrtchyan. Dis-
joint Clustering in Combinatorial Circuits. In Charles J. Colbourn, Roberto
Grossi, and Nadia Pisanti, editors, Combinatorial Algorithms, Lecture Notes
in Computer Science, pages 201–213, Cham, 2019. Springer International
Publishing. 2.2.2, 2.4.1, 4.1.3, 4.3.1, 4.4.2
[62] Zola Nailah Donovan, Kirubakran Subramani, and Vahan Mkrtchyan. Ana-
lyzing Clustering and Partitioning Problems in Selected VLSI Models. The-
ory of Computing Systems, 64(7):1242–1272, October 2020. 2.4.1, 4.1.3, 4.4
[64] Marco Dorigo and Luca Maria Gambardella. Ant colonies for the travelling
salesman problem. Journal of Biosystems, 43(2):73–81, 1997. 2.4.2
[67] Shantanu Dutt and Wenyong Deng. VLSI circuit partitioning by cluster-
removal using iterative improvement techniques. In Proceedings of Interna-
tional Conference on Computer Aided Design, pages 194–200. IEEE, 1996.
2.3.1, 2.4.1
[69] John M. Emmert, Sandeep Lodha, and Dinesh K. Bhatia. On using tabu
search for design automation of VLSI systems. Journal of Heuristics, 9:75–
90, 2003. 2.4.2
[71] Shimon Even. Graph algorithms. Cambridge University Press, 2011. 5.2.2
[72] Wen-Jong Fang and Allen C.-H. Wu. Multiway FPGA partitioning by fully
exploiting design hierarchy. ACM Transactions on Design Automation of
Electronic Systems (TODAES), 5(1):34–50, 2000. 2.4.1
[73] Umer Farooq and Bander A. Alzahrani. Exploring and optimizing partition-
ing of large designs for multi-FPGA based prototyping platforms. Comput-
ing, 102(11):2361–2383, 2020. 2.4.1, 2.4.1
[74] Umer Farooq, Imran Baig, Muhammad Khurram Bhatti, Habib Mehrez,
Arun Kumar, and Manoj Gupta. Prototyping using multi-FPGA platform:
A novel and complete flow. Microprocessors and Microsystems, 96:104751,
2023. 2.4.1
[77] Lester Randolph Ford and Delbert R. Fulkerson. Maximal flow through a
network. Canadian journal of Mathematics, 8:399–404, 1956. 2.2.1
326 J. Rodriguez
References
[79] Sandeep Singh Gill, Bhupesh Aneja, Rajeevan Chandel, and Ashwani Ku-
mar Chandel. Simulated annealing based VLSI circuit partitioning for delay
minimization. In Proceedings of the 4th WSEAS international conference on
Computational intelligence, pages 60–63, 2010. 2.4.2
[80] Sandeep Singh Gill, Rajeevan Chandel, and Ashwani Chandel. Genetic al-
gorithm based approach to circuit partitioning. International Journal of
Computer and Electrical Engineering, 2(2):196, 2010. 2.4.2
[81] Fred Glover and Manuel Laguna. Tabu search. Springer, 1998. 2.4.2
[82] Mark Goldberg and Zevi Miller. A parallel algorithm for bisection width
in trees. Computers & Mathematics with Applications, 15(4):259–266, 1988.
4.3.2
[83] Olivier Goldschmidt and Dorit S. Hochbaum. A Polynomial Algorithm
for the k-cut Problem for Fixed k. Mathematics of Operations Research,
19(1):24–37, February 1994. Publisher: INFORMS. 2.2.1
[84] Ralph E. Gomory. Solving linear programming problems in integers. Com-
binatorial Analysis, 10:211–215, 1960. 2.4.1
[85] Lars Gottesbüren, Michael Hamann, Sebastian Schlag, and Dorothea Wag-
ner. Advanced flow-based multilevel hypergraph partitioning. In Proceedings
of the 18th International Symposium on Experimental Algorithms (SEA),
Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1–11:15,
2020. 2.3.3
[86] Anael Grandjean, Johannes Langguth, and Bora Uçar. On optimal and
balanced sparse matrix partitioning problems. In Proceedings of 2012 IEEE
International Conference on Cluster Computing, pages 257–265. IEEE, 2012.
5.2.1
[87] Scott Hauck and Gaetano Borriello. An evaluation of bipartitioning tech-
niques. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 16(8):849–866, 1997. 2.4.1
[88] Bruce Hendrickson and Robert Leland. A multilevel algorithm for parti-
tioning graphs. In Proceedings of the 1995 ACM/IEEE Conference on Su-
percomputing, Supercomputing ’95, page 28–es, New York, NY, USA, 1995.
Association for Computing Machinery. 2.4.1, 4.1.1
[89] Alexandra Henzinger, Alexander Noe, and Christian Schulz. ILP-based local
search for graph partitioning. ACM, Journal of Experimental Algorithmics
(JEA), 25:1–26, jul 2020. 2.4.1, 5.3
[90] Monika Henzinger, Satish Rao, and Di Wang. Local flow partitioning for
faster edge connectivity. SIAM Journal on Computing, 49(1):1–36, 2020.
2.2.1
[91] Julien Herrmann, Jonathan Kho, Bora Uçar, Kamer Kaya, and Ümit V.
Çatalyürek. Acyclic partitioning of large directed acyclic graphs. In 2017
17th IEEE/ACM international symposium on cluster, cloud and grid com-
puting (CCGRID), pages 371–380. IEEE, 2017. 2.4.1, 5.1
[92] Julien Herrmann, M. Yusuf Ozkaya, Bora Uçar, Kamer Kaya, and Ümit V.
Çatalyürek. Multilevel algorithms for acyclic partitioning of directed acyclic
graphs. SIAM Journal on Scientific Computing, 41(4):A2117–A2145, 2019.
2.4.1, 5.1
[93] Tobias Heuer, Peter Sanders, and Sebastian Schlag. Network flow-based re-
finement for multilevel hypergraph partitioning. In 17th International Sym-
posium on Experimental Algorithms (SEA 2018), pages 1:1–1:19, 2018. 2.3.3
[94] Tobias Heuer, Peter Sanders, and Sebastian Schlag. Network flow-based
refinement for multilevel hypergraph partitioning. Journal of Experimental
Algorithmics (JEA), 24:1–36, 2019. 2.3.3, 6
[95] Tobias Heuer and Sebastian Schlag. Improving coarsening schemes for hy-
pergraph partitioning by exploiting community structure. In Proceedings of
the 16th International Symposium on Experimental Algorithms, (SEA 2017),
pages 21:1–21:19, 2017. 2.3.3, 4.1.1
[96] José Ignacio Hidalgo, Juan Lanchares, and Román Hermida. Partitioning
and placement for multi-FPGA systems using genetic algorithms. In Proceed-
ings of the 26th Euromicro Conference. EUROMICRO 2000. Informatics:
Inventing the Future, volume 1, pages 204–211 vol.1, 2000. 2.4.2
[97] Torsten Hoefler and Marc Snir. Generic topology mapping strategies for
large-scale parallel architectures. In Proceedings of the International Con-
ference on Supercomputing, pages 75–84, 2011. 2.1.3
[98] Karla Hoffman and Manfred Padberg. Solving airline crew scheduling prob-
lems by branch-and-cut. Management Science, 39:657–682, 06 1993. 2.4.1
328 J. Rodriguez
References
[101] Dennis J.-H. Huang and Andrew B. Kahng. Multi-way system partition-
ing into a single type or multiple types of FPGAs. In Proceedings of the
1995 ACM third international symposium on Field-programmable gate ar-
rays, pages 140–145, 1995. 5.2.2
[102] Edmund Ihler, Dorothea Wagner, and Frank Wagner. Modeling hypergraphs
by graphs with the same mincut properties. Information Processing Letters,
45(4):171–175, 1993. 1.3.3, 2.1.2
[104] Andrew B. Kahng and Xu Xu. Local unidirectional bias for smooth cutsize-
delay tradeoff in performance-driven bipartitioning. In Proceedings of the
2003 international symposium on Physical design, ISPD ’03, pages 81–86,
New York, NY, USA, April 2003. Association for Computing Machinery.
2.4.1
[106] George Karypis. Metis: Unstructured graph partitioning and sparse matrix
ordering system. Technical report, 1997. 2.3.1, 2.4.1
[107] George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Mul-
tilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 7(1):69–79, March
1999. Conference Name: IEEE Transactions on Very Large Scale Integration
(VLSI) Systems. 2.4.1, 2.4.2, 6.1.3
[108] George Karypis and Vipin Kumar. Analysis of multilevel graph partitioning.
In Proceedings of the 1995 ACM/IEEE conference on Supercomputing, pages
29–es, 1995. 4.4.2
[109] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme
for partitioning irregular graphs. SIAM Journal on scientific Computing,
20(1):359–392, 1998. 2.3.1, 2.4.1, 4.1.1
[110] George Karypis and Vipin Kumar. Hmetis: a hypergraph partitioning pack-
age. ACM Transactions on Architecture and Code Optimization, 1998. 4.4.2,
6.1.3
[111] George Karypis and Vipin Kumar. A hypergraph partitioning package. Army
HPC Research Center, Department of Computer Science & Engineering,
University of Minnesota, 1998. 2.3.1
[112] George Karypis and Vipin Kumar. Multilevel k-way hypergraph partitioning.
In Proceedings of the 36th annual ACM/IEEE design automation conference,
pages 343–348, 1999. 2.3.1, 2.4.1, 6.1.3
[113] Brian W. Kernighan. Optimal sequential partitions of graphs. Journal of
the ACM (JACM), 18(1):34–40, 1971. 5.2.1
[114] Brian W. Kernighan and Shen Lin. An efficient heuristic procedure for
partitioning graphs. The Bell system technical journal, 49(2):291–307, 1970.
2.4.1, 6, 6.1.1, 6.1.1
[115] Scott Kirkpatrick, C. Daniel Gelatt Jr, and Mario P. Vecchi. Optimization
by simulated annealing. science, 220(4598):671–680, 1983. 2.4.2
[116] Shad Kirmani, Jeonghyung Park, and Padma Raghavan. An embedded
sectioning scheme for multiprocessor topology-aware mapping of irregular
applications. The International Journal of High Performance Computing
Applications, 31(1):91–103, 2017. 2.1.3
[117] Venkataramana Kommu and Irith Pomeranz. GAFPGA: genetic algorithm
for FPGA technology mapping. In Proceedings of the European Design Au-
tomation Conference 1993, EURO-DAC ’93 with EURO-VHDL’93, Ham-
burg, Germany, September 20-24, 1993, pages 300–305. IEEE Computer
Society, 1993. 2.4.2
[118] Helena Krupnova, Ali Abbara, and Gabrièle Saucier. A Hierarchy-Driven
FPGA Partitioning Method. page 4. 2.4.1
[119] Dorothy Kucar, Shawki Areibi, and Anthony Vannelli. Hypergraph parti-
tioning techniques. Dynamics of Continuous Discrete and Impulsive Systems
Series A, 11:339–368, 2004. 5.3
[120] Ailsa H. Land and Alison G. Doig. An automatic method of solving discrete
programming problems. Econometrica, 28(3):497–520, 1960. 2.4.1
[121] Eugene L. Lawler. Cutsets and partitions of hypergraphs. Networks,
3(3):275–285, 1973. 2.2.1
[122] Eugene L. Lawler, Karl N. Levitt, and James Turner. Module clustering
to minimize delay in digital networks. IEEE Transactions on Computers,
100(1):47–57, 1969. 2.4.1, 4.1.2
330 J. Rodriguez
References
[123] Chin Yang Lee. An algorithm for path connections and its applications. IRE
transactions on electronic computers, (3):346–365, 1961. 5.2.1
[124] Ming Leng and Songnian Yu. An effective multi-level algorithm based on
ANT colony optimization for bisecting graph. In Proceedings ot Advances
in Knowledge Discovery and Data Mining: 11th Pacific-Asia Conference,
PAKDD 2007, Nanjing, China, May 22-25, 2007. Proceedings 11, pages 138–
149. Springer, 2007. 2.4.2
[126] Jianmin Li, John Lillis, and Chung-Kuan Cheng. Linear decomposition algo-
rithm for VLSI design applications. In Richard L. Rudell, editor, Proceedings
of the 1995 IEEE/ACM International Conference on Computer-Aided De-
sign, ICCAD 1995, San Jose, California, USA, November 5-9, 1995, pages
223–228. IEEE Computer Society / ACM, 1995. 2.2.1
[127] Sin-Hong Liou, Sean Liu, Richard Sun, and Hung-Ming Chen. Timing Driven
Partition for Multi-FPGA Systems with TDM Awareness. In Proceedings
of the 2020 International Symposium on Physical Design, ISPD ’20, pages
111–118, New York, NY, USA, March 2020. Association for Computing Ma-
chinery. (document), 2.4.1, 3.1, 3.2.2, 5.1, 5.4
[128] Sandeep Lodha and Dinesh Bhatia. Bipartitioning circuits using tabu search.
In Proceedings of the 11th Annual IEEE International ASIC Conference
(Cat. No. 98TH8372), pages 223–227. IEEE, 1998. 2.4.2
[130] Pongstorn Maidee, Cristinel Ababei, and Kia Bazargan. Fast timing-driven
partitioning-based placement for island style FPGAs. In Proceedings of the
40th annual design automation conference, pages 598–603, 2003. 2.4.2
[132] Theodore Manikas and James T. Cain. Genetic Algorithms vs. Simulated
Annealing: A Comparison of Approaches for Solving the Circuit Partitioning
Problem. Technical report, 1996. 2.4.2
[136] Orlando Moreira, Merten Popp, and Christian Schulz. Evolutionary multi-
level acyclic graph partitioning. In Proceedings of the Genetic and Evolu-
tionary Computation Conference, pages 332–339, 2018. 5.1
[137] Kevin E. Murray, Scott Whitty, Suya Liu, Jason Luu, and Vaughn Betz.
TITAN : Enabling large and complex benchmarks in academic CAD. In Pro-
ceedings of the 23rd International Conference on Field programmable Logic
and Applications, pages 1–8, Porto, Portugal, September 2013. IEEE. 3.2,
3.2.2
[139] Dang Phuong Nguyen, Michel Minoux, Viet Hung Nguyen, Thanh Hai
Nguyen, and Renaud Sirdey. Improved compact formulations for a wide
class of graph partitioning problems in sparse graphs. Discrete Optimiza-
tion, 25:175–188, 2017. 5.3
[140] Jenny Nossack and Erwin Pesch. A branch-and-bound algorithm for the
acyclic partitioning problem. Computers & operations research, 41:174–184,
2014. 2.4.1, 5.1
[141] Vitaly Osipov and Peter Sanders. n-level graph partitioning. In Proceedings
of the 18th Annual European Symposium on Algorithms (ESA), Liverpool,
UK, September 6-8, 2010. Part I, 18, pages 278–289. Springer, 2010. 2.3.3,
2.4.1
332 J. Rodriguez
References
[145] Peichen Pan, Arvind K. Karandikar, and Chung Laung Liu. Optimal clock
period clustering for sequential circuits with retiming. IEEE transactions
on computer-aided design of integrated circuits and systems, 17(6):489–498,
1998. 2.4.1, 4.1.3
[146] David Papa and Igor Markov. Hypergraph Partitioning and Clustering.
Handbook of Approximation Algorithms and Metaheuristics, May 2007. 6.1.4
[148] François Pellegrini. Graph partitioning based methods and tools for scientific
computing. Parallel Computing, 23(1-2):153–164, 1997. 2.1.3
[150] François Pellegrini and Cédric Lachat. Process Mapping onto Complex Ar-
chitectures and Partitions Thereof. Research Report RR-9135, Inria Bor-
deaux Sud-Ouest, December 2017. 2.1.3
[151] François Pellegrini and Jean Roman. Scotch: A software package for static
mapping by dual recursive bipartitioning of process and architecture graphs.
In Proceedings of High-Performance Computing and Networking: Interna-
tional Conference and Exhibition HPCN EUROPE 1996 Brussels, Belgium,
April 15–19, 1996 Proceedings 4, pages 493–498. Springer, 1996. 1.3, 2.3.5
[153] Merten Popp, Sebastian Schlag, Christian Schulz, and Daniel Seemaier. Mul-
tilevel acyclic hypergraph partitioning. In 2021 Proceedings of the Workshop
on Algorithm Engineering and Experiments (ALENEX), pages 1–15. SIAM,
2021. 2.4.1, 5.1
[154] Nicolas Pouillon and Alain Greiner. System on Chip Library Project. [online]
Available: https://largo.lip6.fr/trac/dsx/wiki, 2010. 2.4.1
[155] Bryan T. Preas, Michael J. Lorenzetti, and Bryan D. Ackland. Physical De-
sign Automation of VLSI Systems. Benjamin/Cummings Publishing Com-
pany, 1988. 1.3.1
[156] Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. Near lin-
ear time algorithm to detect community structures in large-scale networks.
Physical review E, 76(3):036106, 2007. 5.1
[157] Rajmohan Rajaraman and Martin Derek F. Wong. Optimal clustering for
delay minimization. In Proceedings of the 30th international Design Automa-
tion Conference, DAC ’93, pages 309–314, New York, NY, USA, July 1993.
Association for Computing Machinery. 2.4.1, 4.1.2
[158] Bernhard M. Riess, Konrad Doll, and Frank M. Johannes. Partitioning very
large circuits using analytical placement techniques. In Proceedings of the
31st annual Design Automation Conference, pages 646–651, 1994. 2.3.1, 2.4.1
[159] Neil Robertson and Paul Seymour. Graph minors. I. Excluding a forest.
Journal of Combinatorial Theory, Series B, 35(1):39–61, August 1983. 4.3.1,
4.3.2
[160] Julien Rodriguez, François Galea, François Pellegrini, and Lilia Zaourar. A
hypergraph model and associated optimization strategies for path length-
driven netlist partitioning. In Proceedings of the 23rd International Con-
ference on Computational Science (ICCS), pages 652–660. Springer, 2023.
(document), 1.1.3, 1.2.1, 5, 6
[161] Julien Rodriguez, François Galea, François Pellegrini, and Lilia Zaourar.
Path length-driven hypergraph partitioning: An integer programming ap-
proach. In Maria Ganzha, Leszek A. Maciaszek, Marcin Paprzycki, and
Dominik Slezak, editors, Proceedings of the 18th Conference on Computer
334 J. Rodriguez
References
[162] Julien Rodriguez, François Galea, François Pellegrini, and Lilia Zaourar.
Hypergraph clustering with path-length awareness. In Proceedings of the
24rd International Conference on Computational Science (ICCS). Springer,
to appear. (document), 4
[163] Kalapi Roy-Neogi and Carl Sechen. Multiple FPGA partitioning with per-
formance optimization. In Proceedings of the 1995 ACM Third International
Symposium on Field-programmable gate arrays, pages 146–152, 1995. 2.4.1,
2.4.2
[166] Sadiq M. Sait, Feras Chikh Oughali, and Mohammed Al-Asli. Design parti-
tioning and layer assignment for 3D integrated circuits using tabu search and
simulated annealing. Journal of applied research and technology, 14(1):67–76,
2016. 2.4.2
[168] Peter Sanders and Christian Schulz. Engineering multilevel graph partition-
ing algorithms. In Proceedings of the European Symposium on algorithms
(ESA), pages 469–480. Springer, 2011. 6
[169] Peter Sanders and Christian Schulz. Kahip v3. 00–karlsruhe high quality
partitioning–user guide. arXiv preprint arXiv:1311.1714, 2013. 2.3.3, 2.4.1
[170] Dhiraj Sangwan, Seema Verma, and Rajesh Kumar. An efficient approach
to VLSI circuit partitioning using evolutionary algorithms. In Proceedings
of 2014 International Conference on Computational Intelligence and Com-
munication Networks, pages 925–929, 2014. 2.4.2
[171] Gabrièle. Saucier, Daniel R. Brasen, and Jean-Pierre Hiol. Circuit Par-
titioning For FPGAs. In Gabrièle Saucier and Anne Mignotte, editors,
Logic and Architecture Synthesis: State-of-the-art and novel approaches,
IFIP Advances in Information and Communication Technology, pages 97–
106. Springer US, Boston, MA, 1995. 5.2.3, 5.2.3, 5.2.3
[172] Garièle Saucier, Daniel R. Brasen, and Jean-Pierre Hiol. Partitioning with
cone structures. In Proceedings of 1993 International Conference on Com-
puter Aided Design (ICCAD), pages 236–239, Santa Clara, CA, USA, 1993.
IEEE Comput. Soc. Press. 2.4.1, 5.2.3
[174] Sebastian Schlag, Vitali Henne, Tobias Heuer, Henning Meyerhenke, Peter
Sanders, and Christian Schulz. k-way hypergraph partitioning via n-level
recursive bisection. In Proceedings of the 18th Workshop on Algorithm Engi-
neering and Experiments, (ALENEX 2016), pages 53–67, 2016. 2.3.3, 6.1.3
[176] Daniel G. Schweikert and Brian W. Kernighan. A proper model for the
partitioning of electrical circuits. In Proceedings of the 9th design automation
workshop on Design automation - DAC ’72, pages 57–62, Not Known, 1972.
ACM Press. 1.14, 1.3.3
[178] R Oguz Selvitopi, Ata Turk, and Cevdet Aykanat. Replicated partitioning
for undirected hypergraphs. Journal of Parallel and Distributed Computing,
72(4):547–563, 2012. 2.3.2
[179] Minshine Shih, Ernest S. Kuh, and Ren-Song Tsay. Integer programming
techniques for multiway system partitioning under timing and capacity con-
straints. In Proceedings of 1993 European Conference on Design Automation
with the European Event in ASIC Design, pages 294–298. IEEE, 1993. 5.3
[180] Horst D. Simon and Shang-Hua Teng. How good is recursive bisection?
SIAM Journal on Scientific Computing, 18(5):1436–1445, 1997. 2.1.1
336 J. Rodriguez
References
[181] Adam Słowik and Michał Białko. Partitioning of VLSI circuits on subcircuits
with minimal number of connections using evolutionary algorithm. In Pro-
ceedings of the International Conference on Artificial Intelligence and Soft
Computing, pages 470–478. Springer, 2006. 2.4.2
[183] Yu-Hsuan Su, Emplus Huang, Hung-Hao Lai, and Yi-Cheng Zhao.
Computer-Aided Design Contest Problem B: System-level FPGA Routing
with Timing Division Multiplexing Technique. 3.3.2
[185] Cliff Sze and Ting-Chi Wang. Optimal circuit clustering for delay minimiza-
tion under a more general delay model. Journal of Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on, 22:646–651, June
2003. 2.4.1, 4.1.2
[186] Robert Endre Tarjan. Efficiency of a good but not linear set union algorithm.
Journal of the ACM (JACM), 22(2):215–225, 1975. 5.2.3, 5.2.3
[187] Katerina Tashkova, Peter Korošec, and Jurij Šilc. A distributed multilevel
ant-colony algorithm for the multi-way graph partitioning. International
Journal of Bio-Inspired Computation, 3(5):286–296, 2011. 2.4.2
[188] Mariem Turki, Habib Mehrez, Zied Marrakchi, and Mohamed Abid. Parti-
tioning constraints and signal routing approach for multi-FPGA prototyping
platform. In 2013 International symposium on system on chip (SoC), pages
1–4. IEEE, 2013. 2.4.1
[190] Honghua Yang and Martin Derek F. Wong. Circuit clustering for delay
minimization under area and pin constraints. In Proceedings of the European
Design and Test Conference. ED&TC 1995, pages 65–70, March 1995. 2.4.1,
4.1.2
[191] Zhengxi Yang, Zhipeng Jiang, Wenguo Yang, and Suixiang Gao. Balanced
graph partitioning based on mixed 0-1 linear programming and iteration ver-
tex relocation algorithm. Journal of Combinatorial Optimization, 45(5):121,
2023. 5.3
[192] Jih-Shyr Yih and Pinaki Mazumder. A neural network design for circuit
partitioning. In Proceedings of the 26th ACM/IEEE Design Automation
Conference, pages 406–411, 1989. 1.3.3
[193] Dan Zheng, Xinshi Zang, and Martin Derek F. Wong. TopoPart: a Multi-
level Topology-Driven Partitioning Framework for Multi-FPGA Systems. In
In proceedings of 2021 IEEE/ACM International Conference On Computer
Aided Design (ICCAD), pages 1–8, November 2021. ISSN: 1558-2434. 2.3.4,
2.4.1, 3.2.4
[194] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled
data with label propagation. Technical report, School of Computer Science,
Carnegie Mellon University, Pittsburgh, PA, 15213, 07 2003. 5.1
338 J. Rodriguez
Publications
Conferences
339