Thesis-Parallel Algorithms For Matching
Thesis-Parallel Algorithms For Matching
Department of Informatics
Algorithms
Student: Supervisor:
Håkon Heggernes Lerring Professor Fredrik Manne
Master Thesis
June 2017
Acknowledgement
First of all: I would like to thank Fredrik Manne for supervising me. I could not
have done this without your help. I also want to thank Anna Maria Eilertsen for
good support and relentless positive attitude
Contents
1 Introduction 5
2 Background 6
2.1 Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Atomic operations . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Stable Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Popular Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Sequential algorithms 11
3.1 Stable Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 The gale-shapley algorithm . . . . . . . . . . . . . . . . . 12
3.1.3 The mcvitie-wilson algorithm . . . . . . . . . . . . . . . 13
3.2 The b-Marriage Problem . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Maximum Size Popular Matching . . . . . . . . . . . . . . . . . . . 15
3.3.1 The huang-kavitha algorithm . . . . . . . . . . . . . . . . 15
4 Parallel algorithms 18
4.1 Parallel Stable Marriage . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 A parallel gale-shapley algorithm . . . . . . . . . . . . . 19
4.1.2 A parallel mcvitie-wilson algorithm . . . . . . . . . . . . 20
4.2 Parallel b-Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Parallel Popular Match. . . . . . . . . . . . . . . . . . . . . . . . . 22
3
CONTENTS CONTENTS
5 Exp. results 27
5.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Stable marriage results . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.1 Solution quality . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.2 Easy instances . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.3 Hard instances . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.4 Real graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.5 Xeon Phi performance . . . . . . . . . . . . . . . . . . . . . 37
5.4 b-Marriage results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Popular matching results . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusion 45
Bibliography 46
A Appendix 48
A.1 On Stable Marriages and Greedy Matchings . . . . . . . . . . . . . 48
4
Chapter 1
Introduction
Every year hundreds of thousands of students express their preferences for higher
education in Norway. Each hope-to-be student lists a couple of schools and pro-
grammes they want to be accepted to. The student orders the schools in the order
of most preferred to least preferred, and submits them to The Norwegian Universi-
ties and Colleges Admission Service (NUCAS). NUCAS is then tasked with the job
of deciding who gets to study where. While the task might seem daunting, it is not
without hope. The problem college admissions has been studied since 1962 when
Gale and Shapley introduced a solution to this problem: “College Admissions and
the Stability of Marriage”[2]. With their solution we end up with a set of accepted
applications and an important promise of stability: if a student is accepted to a
less-than-preferred college he will not get accepted to a more preferred school by
talking to them directly. Likewise for the school, if they only get students with
bad grades they will not be able to admit a better student by talking to him/her
directly.
Solving this problem for large amounts of students and colleges take time, and
as we all know: time is money. The question we want to answer is then: can
we solve problems like the college admissions problem faster by utilising multiple
processors of modern CPUs?
Previous attempts at this have been less successful [7], but recent developments
in the closely related field of weighted maximum matching suggest a new way [6][9].
In this thesis we use these ideas to present new parallel algorithms for the stable
marriage problem, the b-Marriage problem and the Popular Matching problem.
Parts of this thesis is published as the paper “On Stable Marriages And Greedy
Matchings”[8] to the SIAM Workshop on Combinatorial Scientific Computing (CSC16)
and is provided in appendix A.1. The paper received the Best Paper Award.
5
Chapter 2
Background
2.1 Parallelisation
To understand the work we have done it is useful to have some basic understand-
ing of how multi threading works on modern operating systems and some of the
tools we used to develop our parallel algorithms. We assume basic knowledge of
programming and a basic intuition of how programs compile to a sequential list of
instructions (a program).
In regular sequential applications there is usually a single execution path through
the program. Running the program, the operating system spawns a process, allo-
cates some time on a processor to the process and executes it. To ensure that other
process can interfer with the execution of the program, the memory allocated is
usually kept private. Modern computers often feature multiple processors, letting
the operating system safely run multiple programs concurrently without two pro-
grams modifying each others states. Running our program on only one processor
at a time limit us to the execution speed of a single processor.
One way to speed up the execution of our program is to run multiple processes,
each with their own memory space and a part of the problem to solve, without
any communication between the processes. As long as there are no dependencies
between the parts we run in parallel, this is a valid way to speed up the execution.
When the execution of the processes depend on each other, they need to communi-
cate. When possible, inter-process communication can be expensive time-wise and
also requires careful planning of how the processes communicate. One option is
to let the programs share memory, which is the basis for multithreaded program-
ming, letting each program (or thread) make decisions based on a shared view
of the memory. However, this can lead to what is know as a race condition. A
6
CHAPTER 2. BACKGROUND 2.1. PARALLELISATION
race condition is a situation where the result of some computation depends on the
timing of the execution. We illustrate this in Figure 2.1 where the result of two
threads executing an incrementation of a variable (a++), depends on how the oper-
ations are interleaved. In our example incrementing is implemented as a three-step
operation: 1. read the memory location to a temporary variable 2. write back the
temporary variable plus one 3. return the temporary variable.
tmp’ = a 0
a = tmp + 1 1
y
a = tmp’ + 1 1
return tmp return tmp’ 1
Figure 2.1: Interleaved incrementation of a
Each thread first fetches the value of a to a temporary variable, increments the
value of the temporary variable, and then store it back to variable a. Even though
we expected the final value of the variable a to be 2, the value each thread writes
back is 1.
Allowing each thread to do both operations in one go without the other thread
doing any simultaneous operations would give a correct result, but this requires
some kind of synchronisation between the threads. Synchronisation is the process
where multiple threads, through some kind of communcation, agree to pause for
some condition to occur. Synchronisation can reduce the gain we get from running
on multiple processors by doing extra work or waiting for other threads.
There are two “layers” in which we are provided tools that help us with paral-
lelising our code. One of which is OpenMP, an API for developing parallel programs
on shared-memory computers, and the other being primitives provided by the CPU
for doing safe concurrent operations.
2.1.1 OpenMP
OpenMP provides an interface for splitting the execution of a program into mul-
tiple threads. This is done by prepending a block of code with the #pragma omp
parallel preprocessing directive which, when precompiled by a compiler sup-
porting OpenMP, makes the block of code run in parallel on as many threads as
supported by the system. OpenMP also provides a set of easy-to-use functions for
7
2.1. PARALLELISATION CHAPTER 2. BACKGROUND
common tasks inside a parallel region, most notable (for our use) of which are the
#pragma omp for, omp_set_lock/omp_unset_lock, and omp_get_wtime.
The #pragma omp for takes a well-formed (no multiple exits or unpredictable
jumps) for-loop and splits it so that each thread gets a part of the iterations. It
can be configured with extra directives such as how to split the data (schedule
directive), which variables should be private to the thread (private directive) or
how to combine the result of the iterations through some kind of reduction function
(the reduction directive).
The omp_set_lock and omp_unset_lock functions are used for synchronisation
between the threads. The functions takes a shared memory location as parameter
which is used as a lock instance. Once a lock is set, it cannot be set again from any
other thread without being unset first. OpenMP also has a more coarse-grained
version of these locks, the #pragma omp critical directive which does not require
the programmer to set up a lock instance, but ensures that only one thread at any
given time will be in the region of code covered by the critical pragma. In our
algorithms we can prevent unnecessary waiting by ensuring that only threads with
overlapping memory reads/writes use the same lock instance. A region of code
protected by a lock is also called a critical section.
Some of the same functionality as OpenMP provices is available through for
example the pthreads library (with quite a bit of extra programming), but we chose
OpenMP for its ease of use and preexisting knowledge of the API.
8
CHAPTER 2. BACKGROUND 2.2. MATCHING
Operation Description
tmp = *a;
atomic_fetch_and_add(*a, b) a = tmp + b;
return tmp;
if(*a != b) {
return false;
atomic_compare_and_swap(*a, b, c) }
a = c;
return true;
Figure 2.2: Relevant atomic operations
work left more even. Uneven load balancing is a situation where some thread
of the program continutes to execute tasks with other threads idling. This can
be because the problem instance is divided between the threads in a way that
leaves some threads with more work than others. This is waste of precious CPU-
time as some processors may go idle while there is still work to do. Knowing the
partitioning of the problem instance beforehand can be be computationally hard
and require too much time to process, and as a result load balancing is often done
during runtime.
2.2 Matching
9
2.3. STABLE MARRIAGE CHAPTER 2. BACKGROUND
10
Chapter 3
Sequential algorithms
3.1.1 Introduction
The problem of college admissions was first introduced by Gale and Shapley with
their algorithm for solving the problem of admitting students to colleges. In [2],
they describe the familiar situation where the students on one side wants to be
accepted by the best (based on some personal ranking) college, while the colleges
on the other side wants to admit their best (by grades or other measures) ranked
students.
Gale and Shapley gave their solution (which we call the gale-shapley algo-
rithm) to another problem: the stable marriage problem. In the stable marriage
problem each side has a quota of 1 to fill. Instead of students and colleges there
are men an women ready to marry. McVitie and Wilson later expanded on the
gale-shapley algorithm [10], by giving a recursive version we call the mcvitie-
wilson algorithm.
11
3.1. STABLE MARRIAGE CHAPTER 3. SEQUENTIAL ALGORITHMS
12
CHAPTER 3. SEQUENTIAL ALGORITHMS
3.2. THE B-MARRIAGE PROBLEM
Algorithm 1 terminates when there are no more elements on the queue, at which
point the set of edges {(suitor(r), r) | r ∈ R} is our matching. In SM instances,
where each man ranks each woman and vice versa, it is shown that Algorithm 1
always produce a complete stable solution.[2] This may not be true for the case
of SMI instances, which will be stable but not all men and women might get a
match. To support SMI instances we modify the nextCandidate(l) routine to
return N U LL after the last candidate.
3.2.1 Introduction
The b-marriage problem is similar to the previously described stable marriage
problem in Section 2.3. The difference being that each vertex can be specified to
be matched with any number of neighbours. This is closer to the original college
13
3.2. THE B-MARRIAGE PROBLEM
CHAPTER 3. SEQUENTIAL ALGORITHMS
admissions problem where one college can be assigned multiple students. For the
college admissions problem, b is the number of students the college can admit.
For the students, on the other hand, b is fixed to 1 as one student should not get
admitted to multiple colleges simultaneously. In this case where only one side of
the problem has a b value greater than 1 this can be solved with an algorithm for
the stable-marriage problem by duplicating the vertices for colleges b times (and
any incident edges). If both sides has a b > 1 then the copying technique, applied to
both sides can give a situation where one student is admitted to the same college
multiple times. Instead of dealing with colleges and students, we continue with
men and women. Imagining a situation where we have to find a set of polygamous
stable marriages. Each man and woman can have individual “quotas” on how many
of the other sex he/she wants to be matched with. In our experiments the quotas
for the men and women will be uniform, but individual quotas does not change
the problem. We denote the number of wives for man l as bm (l) and the number
of husbands for woman r as bw (r). We only look at the case where both bm (l) and
bw (r) is > 1 ∀(l, r) ∈ LXR.
14
CHAPTER 3. SEQUENTIAL ALGORITHMS
3.3. MAXIMUM SIZE POPULAR MATCHING
3.2.2 Algorithm
In both the gale-shapley (Section 3.1.2) and the mcvitie-wilson algorithms
(Section 3.1.3) a woman with more than one proposal rejects the worst ranked.
The algorithm for solving b-marriage is similar to these algorithms, but allows a
woman w to have up to bw (w) proposals before she has to reject one. This is done
by introducing a variable pq(w) which returns a max-priority-queue (based on the
ranks of w) capable of storing the bw (r) + 1 suitors for woman w. Initially all
priority queues are empty. When a man proposes to a woman w the proposal is
added to the queue p(w). If the size of w’s queue |pq(w)| ≤ bw (w) the worst ranked
suitor is rejected (by a dequeue operation) and is recursively added to the solution,
trying his next candidate. Observe that matching a man to bm women is the same
as matching bm copies of the man with each their woman, as long as two copies
of the same man is not matched to the same woman. We use this observation
in the algorithm and introduce each man bm times to the solution. Because the
copies of a man share the nextCandidate(l) pointer, which is increasing with each
introduction, we avoid matching two copies of a man to the same woman.
The algorithm terminates when all the copies of the men has been introduced
to the solution and the recursive calls have returned. At this point the contents of
the queues holds the edges of the stable matching. Each woman w is matched is
matched to each suitor in her queue pq(w).
In Algorithm 3 we give a pseudo-code implementation of this algorithm.
15
3.3. MAXIMUM SIZE POPULAR
CHAPTER
MATCHING
3. SEQUENTIAL ALGORITHMS
16
CHAPTER 3. SEQUENTIAL ALGORITHMS
3.3. MAXIMUM SIZE POPULAR MATCHING
on a queue returns the first element from the queue without removing it.
17
Chapter 4
Parallel algorithms
18
CHAPTER 4. PARALLEL ALGORITHMS
4.1. PARALLEL STABLE MARRIAGE
ranking(a,rejected)) z
if (ranking(a,y) <
z
y
ranking(a,rejected))
Q.enqueue(rejected) z
Q.enqueue(rejected) z
suitor(a) = x x
suitor(a) = y y
Figure 4.1: Example of interleaved proposal
One way to mitigate this issue is to lock the threads before updating, one per
19
4.1. PARALLEL STABLE MARRIAGE
CHAPTER 4. PARALLEL ALGORITHMS
woman to ensure that each thread gets exclusive access to this critical section.
As noted in Section 2.1 once a thread has obtained a lock, no other thread can
obtain the same lock before it is unlocked. An alternative method is to use atomic
compare and swap on the value of suitor. An atomic compare and swap fails if
another thread has already changed it from the value of rejected, letting the thread
fetch the new value for suitor(a) and retry the operation.
We opted for a setup where each thread has their own queue of suitors. The
alternative is one shared queue with lock-protected access, but this creates con-
tention as the thread count grows. Our approach reduces the sequential portion
of the program by not having to lock on a shared queue. The downside being that
the lengths of the queues might become unbalanced without any ability to share
work between threads. Note that as the algorithm progresses the men will move
between queues.
In Algorithm 5 we present our pseudo-code for the parallel gale-shapley
algorithm. In Algorithm 5 the parfor operation splits the men in L into |L| p
parti-
tions, where p is the number of threads, iterating the different partitions in different
threads. The men are added to the queue Q, which contains the rejected or not
yet matched men. One important invariant kept is that a man can not exist in
more than one queue at at any time. This invariant is useful as we can progress
the nextCandidate(u) sequence from one thread without worrying about other
threads interfering. Once a potential partner for man u has been found the current
suitor of the partner is swapped with u using the CAS(∗a, b, c) operation. The
CAS(∗a, b, c) operation is the compare-and-swap operation as presented in Sec-
tion 2.1.2, which will only update (atomically) the suitor(partner) value if it still
is the same as opponent. If the value of suitor(partner) was updated since we read
the value the operation will fail as the CAS operation returns another opponent
than the one we previously compared our current suitor against. The current man
u is then retried against the new opponent. If he cannot beat the new opponent
we put the man back on the queue to be retried against a new partner.
The algorithm terminates when all the queues are empty.
20
CHAPTER 4. PARALLEL ALGORITHMS 4.2. PARALLEL B-MARRIAGE
21
4.3. PARALLEL POPULAR MATCH.
CHAPTER 4. PARALLEL ALGORITHMS
method where we first check if a man can win against the current opponent (the
worst ranked man that has proposed), lock and check if that man still is a better
suitor for the partner, and add the man to the priority queue. If the man no longer
is a better suitor he is considered rejected and is immediately recursively retried
with his next partner. The check before locking lets the algorithm avoid locking
for cases where there is no chance of the application being accepted by the college.
Using thread-safe priority-queue implementations would avoid locking here, but we
want our algorithm to not depend on the safety of the queue implementation.
In Algorithm 7 we give our implementation of the parallel b-marriage algo-
rithm. In this implementation we show a new implementation of nextCandidate(l)
which is thread-safe. To implement the nextCandidate we introduce a new variable
candidate(l) which we can atomically fetch and add (FAA) (see Section 2.1.2).
22
CHAPTER 4. PARALLEL ALGORITHMS
4.3. PARALLEL POPULAR MATCH.
23
4.3. PARALLEL POPULAR MATCH.
CHAPTER 4. PARALLEL ALGORITHMS
24
CHAPTER 4. PARALLEL ALGORITHMS
4.3. PARALLEL POPULAR MATCH.
25
4.3. PARALLEL POPULAR MATCH.
CHAPTER 4. PARALLEL ALGORITHMS
Experimental results
In this chapter we present the runtime results of the parallel algorithms presented
in Chapter 4
27
5.2. INSTANCES CHAPTER 5. EXP. RESULTS
For each of the algorithms we first run the sequential version where all
OpenMP/parallelisation-related code has been removed. The sequential version is
optimised for single-thread speed and does not have to care about race conditions
with other threads. The run time measured on the best sequential algorithm is
used as a baseline when calculating the speedup gained from parallelisation.
The run times in the parallel versions are measured using the OpenMP built-in
omp_get_wtime function. The timing includes the allocation and initialisation of
memory for use in the algorithm, but not the loading of input data.
• Native: compiling the program for the mic architecture, logging in to the
linux-os running on the card and then execute the program.
• Offload: using OpenMP pragmas like offload, which transfers the given func-
tion and any dependent data via the PCIe-bus onto the card, resuming exe-
cution on the card.
We primarily used the offload mode because the native mode could make the card
unstable. This would happen if the program was to run out of memory, in which
case the card would hang rebooted.
5.2 Instances
To evaluate the effectiveness of our parallelisation we run our algorithms on a
combination of random graphs and real graphs. We constructed 2 different types of
random graphs: an “easy” random graph with a realtively low number of randomly
28
CHAPTER 5. EXP. RESULTS 5.2. INSTANCES
distributed edges (between log(n) and 2 · log(n)) per vertex and a “hard” instance
where each side had a common ranking and (|M | · |W |) edges.
The easy instances were created by assigning a random number of edges between
log(n) and 2log(n) to each man. For each man the ranks were then shuffled. These
random graphs are considered easy because the average number of proposals to
reach a stable matching is < 5. Table 5.1 shows the easy graphs we generated. The
125M instance was the largest one we could fit in RAM on lyng without resorting
to swapping data out and in from disk. Because of the restrictive memory limit on
the Xeon Phi, the largest graph we could fit was at 20M vertices.
What we call a hard instance is the case where each man share the same prefer-
ence for the women, meaning they rank them equally, and the woman do the same
for the men. We consider this hard because in any stable matching on this instance
type only one suitor would get his best ranked, only one would get his second best
ranked and so on. This means our algorithms considers 1 + 2 + 3 + .. + n edges,
summing up to (n−1)·n
2
edges. Table 5.2 shows the hard instance we generated.
The real world graphs we use are listed in Table 5.3. The graphs were re-
trieved from the SuiteSparse Matrix Collection (formerly known as the University
of Florida Sparse Matrix Collection).[1] For these problems we ran a reduction as
29
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS
30
CHAPTER 5. EXP. RESULTS 5.3. STABLE MARRIAGE RESULTS
100
96 6
94 4
92
Average rank
2
Easy instances Min. number of edges
90
10M 40M60M 100M125M 10M 40M60M 100M125M
Problem Size Problem Size
Figure 5.2: Easy instances solution qual- Figure 5.3: Easy instances average mens
ity rank
The solution quality is expected as the average final rank in the man-optimal
solution is < 5, as showin in Figure 5.3.
31
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS
mcvitie-wilson p = 72
Time (seconds) 80 gale-shapley p = 72
60
40
20
0
10 40 60 75 100 125
N (millions)
32
CHAPTER 5. EXP. RESULTS 5.3. STABLE MARRIAGE RESULTS
20
10
10
1 9 18 27 36 72
Threads
display this dip in performance. On 72 threads there are multiple threads per
processor which lessens the impact of one slower thread. We believe this difference
in speedup, with the gale-shapley algorithm getting a better speedup, is due to
the additional cache available with each extra cpu core.
33
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS
the slower one. We believe this is due to the extra queue operations, which now
outweighs the cache gains from the sequential memory access in the first part, now
being less than 0.3% of the total runtime on the sequential version.
200
Time (seconds)
150
100
50
mcvitie-wilson p = 72
0 gale-shapley p = 72
34
CHAPTER 5. EXP. RESULTS 5.3. STABLE MARRIAGE RESULTS
400K mcvitie-wilson
15 400K gale-shapley
500K mcvitie-wilson
500K gale-shapley
Speedup
10
1
0
1 9 18 27 36 72
Threads
0.6
0.4
0.2
1 9 18 27 36 72
Threads
35
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS
We will now present our results on the graphs obtained from the SuiteSparse Matrix
Collection.
kron_g500-logn21 mcvitie-wilson
0.3 kron_g500-logn21 gale-shapley
Fault_639 mcvitie-wilson
Fault_639 gale-shapley
Time (seconds)
0.1
0
1 9 18 27 36 72
Threads
Figure 5.11: Real instances timing (less is better)
As previously noted these graphs are preprocessed into a stable marriage in-
stance, which includes sorting and expansion of the graph, but the time for this is
not included in our timing. Because of this the time is not really comparable to
other Greedy Matching results. As Figure 5.11 shows, the time it takes to obtain
a stable matching in these graphs is quite small (all instances were solved in under
a second sequentially), and the gain from parallelising is diminishing. Regardless
of that we get a speedup of 4.5 (Figure 5.12) on the kron_g500-logn21 graph.
The time it took to solve the Reuters911 graph was below the resolution of our
measurement.
36
CHAPTER 5. EXP. RESULTS 5.4. B-MARRIAGE RESULTS
10
kron_g500-logn21 mcvitie-wilson
kron_g500-logn21 gale-shapley
Speedup (more is better)
8
Fault_639 mcvitie-wilson
Fault_639 gale-shapley
6 Reuters_911 mcvitie-wilson
Reuters_911 gale-shapley
4
0
1 9 18 27 36 72
Threads
Figure 5.12: Real instances speedup (more is better)
37
5.4. B-MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS
0.7
mcvitie-wilson @ lyng
gale-shapley @ lyng
0.6 mcvitie-wilson @ Xeon Phi
gale-shapley @ Xeon Phi
Time (seconds)
0.5
0.4
0.3
10M 20M
N
Figure 5.13: Best Xeon Phi (240 threads) vs Lyng (72 threads) times on easy
instances (less is better)
500
gale-shapley @ Xeon Phi
mcvitie-wilson @ Xeon Phi
400 gale-shapley @ lyng
mcvitie-wilson @ lyng
Time (seconds)
300
200
100
0
100K 200K 300K 400K
N
Figure 5.14: Best Xeon Phi (240 threads) vs Lyng (72 threads) times on hard
instances instances (less is better)
38
CHAPTER 5. EXP. RESULTS 5.4. B-MARRIAGE RESULTS
10
10M mcvitie-wilson 10M gale-shapley
20M mcvitie-wilson 20M gale-shapley
8
Speedup
2
1
72 80 96 112 128 144 160 176 192 208 224 240
Threads
Figure 5.15: Easy instances speedup on Xeon Phi over gale-shapley on Lyng
(more is better)
39
5.4. B-MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS
b-marriage b = 1 400 b = 3 p = 72
60 mcvitie-wilson b = 5 p = 72
Time (seconds)
Time (seconds)
300
40
200
20
100
0 0
10M 20M 40M 50M 60M 10M30M50M 75M 100M125M
N N
Figure 5.16: b-marriage (LS) b = 1 Figure 5.17: b-marriage (heap) best
vs mcvitie-wilson runtime (less is bet- runtime (less is better)
ter)
when b = 3. This is likely due to there being 53 times more matches to be made.
Figure 5.18 shows speedup on easy instances (with heap as queue implemen-
tation) with b = 3 and b = 5. Here we see that that the parallel algorithm gives
speedups of up to 41x on b = 5 and up to 27.8x on b = 3. We think the increase in
speedup with the increased value of b is caused by the parallel algorithm spending
proportionally less time in a critical section.
In our tests the atomic operation did better than locking and unlocking with
OpenMP lock structures.
Recall from Algorithm 7 that if the queue is not full it does not dequeue the
worst candidate. When we increase the value of b, we increase the amount of
matches being made without rejecting an existing proposal, and therefore spend
proportionally less time with a lock. Depending on the implementation of the
queue this operation might be expensive. The results presented Figure 5.18 were
done with a heap, which requires O(log(b)) operations for each enqueue/dequeue
operation.
In Figure 5.19 we run the same algorithm where the heap is replaced with a
linear search through an array. The linear search requires O(b) operations for both
enqueue and the enqueue/dequeue operation, but has a much smaller overhead on
small values of b. The difference in time spent locked in the two execution paths
is visible in the plot, making the difference in speedup between the two values of b
smaller. This difference in efficiency is also evident in the effiency plot presented
40
CHAPTER 5. EXP. RESULTS 5.4. B-MARRIAGE RESULTS
40
30
Speedup
20
10 75M b = 3 75M b = 5
100M b = 3 100M b = 5
10 125M b = 3 125M b = 5
1 9 18 27 36 72
Threads
Figure 5.18: Easy instances speedup with heap variant (more is better)
30
25
20
Speedup
15
10
75M b = 3 75M b = 5
5
100M b = 3 100M b = 5
10 125M b = 3 125M b = 5
1 9 18 27 36 72
Threads
Figure 5.19: Easy instances speedup with linear search variant (more is better)
41
5.5. POPULAR MATCHING RESULTS CHAPTER 5. EXP. RESULTS
1
Efficiency
0.8
0.6
0.4
1 9 18 27 36 72
Threads
Figure 5.20: Easy instances efficiency comparison (T (1)/T (p)/p, more is better)
in Figure 5.20, where the efficiency of the LS variant is less dependent on b than
the heap variant.
For the hard instances we experienced more variance in the speedup, as shown
in Figure 5.21. For these results we did not have exclusive access to the machine,
but they serve as an indication of what we can expect without competition.
42
CHAPTER 5. EXP. RESULTS 5.5. POPULAR MATCHING RESULTS
40
30
Speedup
20
10
100K b = 5 200K b = 5
10 300K b = 5 400K b = 5
1 9 18 27 36 72
Threads
Figure 5.21: Hard instances speedup (more is better)
Parallel huang-kavitha
4 Parallel mcvitie-wilson
Time (seconds)
0
10M 20M 40M 50M 60M 75M 100M 125M
N
43
5.5. POPULAR MATCHING RESULTS CHAPTER 5. EXP. RESULTS
25
20
Speedup
15
10
44
Chapter 6
Conclusion
In this chapter we will first give a summary of the thesis, a short conclusion based
on our experimental results, and then some thoughts about further improvements.
In the first three chapters we introduced the notion of matching, parallelising
and preference. We then gave implementations to three different problems for
matching under preference: the problem of stable marriage, the closely related
b-marriage and popular matching problem. In chapter 4 we first presented some
previous work on parallel algorithms for the stable marriage problem. We intro-
duced our parallel algorithm for the previously presented problems based on ideas
from the parallel suitor and b-Suitor algorithms. In chapter 5 we presented
runtime results for both the sequential and parallel algorithms showing scaling on
all but the smallest instances.
The experimental results show that the ideas presented in the Suitor and
b-Suitor transfer well to the parallel algorithms for matching under preferences
presented in this paper.
45
Bibliography
[1] T. A. Davis and Y. Hu. The university of Florida sparse matrix collection.
ACM Transactions on Mathematical Software, 38(1):1 – 25, 2011.
[2] D. Gale and L. S. Shapley. College admissions and the stability of marriage.
The American Mathematical Monthly, 69(1):9–15, 1962.
[3] David Gale and Marilda Sotomayor. Some remarks on the stable matching
problem. Discrete Applied Mathematics, 11(3):223 – 232, 1985.
[4] Chien-Chung Huang and Telikepalli Kavitha. Popular matchings in the sta-
ble marriage problem. Information and Computation, 222:180 – 194, 2013.
38th International Colloquium on Automata, Languages and Programming
(ICALP 2011).
[6] Arif Khan, Alex Pothen, Md. Mostofa Ali Patwary, Nadathur Rajagopalan
Satish, Narayanan Sundaram, Fredrik Manne, Mahantesh Halappanavar,
and Pradeep Dubey. Efficient approximation algorithms for weighted $b$-
matching. SIAM Journal on Scientific Computing, 38(5):S593–S619, 2016.
[7] Jesper Larsen, Claus C. Carøe, and David Pisinger. A parallel approach to
the stable marriage problem. pages 277–287, 1997.
[9] Fredrik Manne and Rob H. Bisseling. A Parallel Approximation Algorithm for
the Weighted Maximum Matching Problem, pages 708–717. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2008.
46
BIBLIOGRAPHY BIBLIOGRAPHY
[11] H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic opera-
tions on modern architectures. In 2015 International Conference on Parallel
Architecture and Compilation (PACT), pages 445–456, Oct 2015.
[12] S. S. Tseng and R. C. T. Lee. A parallel algorithm to solve the stable marriage
problem. BIT Numerical Mathematics, 24(3):308–316, 1984.
47
Appendix A
Appendix
48
On Stable Marriages and Greedy Matchings
Fredrik Manne∗ , Md. Naim∗ , Håkon Lerring∗, and Mahantesh Halappanavar†
Research on stable marriage problems has a long and ing greedy weighted matchings, and algorithms for solving
mathematically rigorous history, while that of exploiting stable marriage problems. Although there exist exact algo-
greedy matchings in combinatorial scientific computing is rithms for solving various weighted matching problems these
a younger and less developed research field. In this paper tend to have running times that typically involve the product
we consider the relationships between these two areas. In of the number of vertices and the number of edges. As large
particular we show that several problems related to comput- graph instances can contain tens of millions of vertices and
ing greedy matchings can be formulated as stable marriage billions of edges it is clear that such algorithms can easily
problems and as a consequence several recently proposed al- become infeasible. For this reason there has been a strong in-
gorithms for computing greedy matchings are in fact special terest in developing fast approximation algorithms and also
cases of well known algorithms for the stable marriage prob- in parallelizing these, see [19] and the references therein. Al-
lem. though such algorithms typically only guarantee an approx-
However, in terms of implementations and practical imation factor of 0.5 compared to the optimal one, practical
scalable solutions on modern hardware, work on computing experiments have shown that they are very often only within
greedy matchings has made considerable progress. We a few percent from optimal. One such algorithm is the clas-
show that due to the strong relationship between these two sical greedy algorithm applied to an edge weighted graph.
fields many of these results are also applicable for solving Here edges are considered by decreasing weight and an edge
stable marriage problems. This is further demonstrated by is included in the matching if it is not adjacent to an already
designing and testing efficient multicore as well as GPU included edge.
algorithms for the stable marriage problem. The main contributions of this paper are as follows. Ini-
tially we consider implementation issues when designing ef-
1 Introduction ficient algorithms for the stable marriage problem. Next,
In 1962 Gale and Shapley formally defined the stable mar- we show that several recently published algorithms for com-
riage problem and gave their classical algorithm for its so- puting greedy matchings are in fact special cases of classi-
lution [6]. Since then this field has grown tremendously cal algorithms for stable marriage problems. This also in-
with numerous applications both in mathematics and in eco- cludes a generalization of the matching problem known as
nomics. For a recent overview see the book by Manlowe b-matching. Here each vertex can be matched with several
[17]. Graph matching is a related area where the object is other vertices in the final solution. Due to the strong similari-
also to find pairs of entities satisfying various optimality cri- ties between the stable marriage problem and greedy match-
teria. These problems find a large number of applications. ing, we show that one can apply recent results on design-
For an overview of some of these motivated from combina- ing scalable greedy matching algorithms to the computation
torial scientific computing see [24]. of stable marriage solutions. This is verified by presenting
While research on stable marriage problems has mainly efficient parallel implementations of various types of Gale-
focused on theory and mathematical rigor, work on graph Shapley type algorithms for both multithreaded computers
matching in combinatorial scientific applications has had as well as for GPUs.
a larger practical component concerned with implementing The remainder of the paper is organized as follows.
and testing code on various computer architectures with the In Section 2 we review the Gale-Shapley algorithm and
intent of developing fast scalable algorithms. consider implementation issues related to this. Next, in
In this paper we investigate the connection between Section 3 we show that the computation of a greedy matching
one type of matching problems, namely those of comput- can be reformulated as a stable marriage problem. This is
also shown to be true for greedy b-matching. In Section 4
we give parallel implementations of the Gale-Shapley and
∗ Department of Informatics, University of Bergen, N-5020 Bergen,
McVitie-Wilson algorithms and show their scalability, before
Norway. Email: {fredrikm,naim}@ii.uib.no, [email protected]
† Pacific Northwest National Laboratory, 902 Battelle Boulevard, concluding in Section 5.
P.O.Box 999, MSIN J4-30, Richland, WA 99352, USA. Email: ma-
[email protected]
2 The Stable Marriage Problem round. The running time of such a scheme is Θ(n2 ) even for
In the following we describe the stable marriage (SM) prob- an instance of SMI.
lem and how it can be solved using the Gale-Shapley algo- If one is willing to forgo the requirement that the pro-
rithm. We also consider some implementation issues related posals in each round must be made in the same relative or-
to the algorithm. Finally, we review some generalizations of der then it is not hard to design an implementation of the
the SM problem. Gale-Shapley algorithm with running time proportional to
The SM problem is defined as follows. Let L and R the number of actual proposals made. To do this one can
be two equal sized sets L = {l1 , l2 , . . . , ln } and R = maintain a queue Q of men waiting to make their proposals.
{r1 , r2 , . . . , rn }. The entries in L are typically referred to Initially Q = L and in each step of the algorithm the man
as “men”, while the entries in R are referred to as “women”. at the front of the queue gets to propose to his current best
Every man and woman has a total ranking of all the members candidate rj ∈ R, and any rejected li is inserted at the end
of the opposite sex. These rankings give the “desirability” of the queue. This will ensure that all men rejected in round
for each participant to match with a participant in the other t gets to propose before any man rejected in round t + 1, but
set. The object is now to find a complete matching M (i.e. a the relative order among the men might not always be the
paring) between the entries in L and R such that no two same. The algorithm terminates when the queue is empty.
li ∈ L and rj ∈ R both would obtain a higher ranked One simple enhancement one can make to the Gale-
partner if they were to abandon their current partner in M Shapley algorithm is that no li ∈ L should propose to an
and rematch with each other. Any solution satisfying this rj ∈ R who already has a proposal from someone whom rj
condition is stable. ranks higher than li , as such a proposal will be rejected. Thus
Gale and Shapley [6] defined the stable marriage prob- each li should propose to his most preferred rj where li has
lem and also proposed the first algorithm for solving it. The not already been rejected and where rj ranks li higher than
algorithm operates in rounds as follows. In the first round her current best proposal (if any). In terms of implementation
each man in L proposes to his most preferred woman in this means that it is sufficient to only maintain the current
R. Each woman will then reject all proposals except the best proposal for each rj . When the algorithm terminates
one she has ranked highest. In subsequent rounds each man these proposals will make up the solution. We give our
that was rejected in the previous round will again propose to complete implementation of the Gale-Shapley algorithm in
the woman which he has ranked highest, but now disregard- Algorithm 1.
ing any woman that he has already proposed to in previous In Algorithm 1 each rj has a variable suitor(rj )
rounds. Gale and Shapley showed that this process will ter- initialized to N U LL that holds her current best pro-
minate with each man in L being matched to a woman in R posal. Similarly, ranking(rj , li ) returns rj ’s ranking of
and that this solution is stable. Although an SM instance can li (as a number in the range 1 through n). We define
have many stable solutions, the Gale-Shapley algorithm will ranking(rj , N U LL) = n + 1 to ensure that any proposal
always produce the same one. is better than no proposal. The function nextCandidate(li )
An important variant of this problem is when each par- will initially return li ’s highest ranked woman and then for
ticipant has only ranked a subset of the opposing partici- successive calls return the next highest ranked one following
pants. This problem is known as the stable marriage prob- the last one retrieved.
lem with incomplete lists (SMI). Any solution M to an SMI For an SM instance it is straight forward to precompute
instance must then in addition to being stable, also consists the values of ranking() in O(n2 ) time. However, for
of mutually ranked pairs (li , rj ). The SMI problem is solved an SMI instance maintaining a complete ranking() table
by the Gale-Shapley algorithm but the solution might not be would require O(n2 ) space and also proportional time to
complete leaving some participants unmarried [6]. initialize it. In this case it is more efficient to store the value
There exists a number of variants of the SM problem, of ranking(ri , lj ) together with ri in lj ’s ranking list so that
for two comprehensive surveys see the books [10, 17]. In the it can be fetched in O(1) time when needed. These values
following we will only consider the classical SM and SMI can be precomputed in time proportional to the sum of the
problems. lengths of the ranking lists. To do this one first traverses
The original Gale-Shapley algorithm is described as the women’s lists building up lists for each man lj with the
operating in rounds, where only the men who were rejected women that have ranked him along with in what position.
in round t will propose in round t + 1. It is not stated in Then using an array position() of length n initially set to
which order the proposals in a round should be made or what 0, the list of each man lj is processed as follows. For each
kind of data structures to use. If one traverses the men in L woman ri that has ranked lj we store the value lj along with
in their original order in each round and lets each rejected in what position ri has ranked lj in position(ri ). We next
man propose once it is discovered, then this will ensure that traverse lj ’s priority list and for each ri in the list we look
the men always propose in the same relative order in each up position(ri ) and see if it contains lj . If so, we fetch ri ’s
ranking of lj and store it together with lj ’s ranking of ri . At preferences, if the men’s preferences are random, then the
the same time any ri that has not ranked lj but which lj has expected sum of men’s rankings of their mates as assigned
ranked can be purged from the priority list of lj . by the Gale-Shapley algorithm is bounded above by n(1 +
1/2 + ... + 1/n). Knoblauch [16] showed that this is also
Algorithm 1 The Gale-Shapley algorithm using a queue an approximate lower bound in the sense that the ratio of
1: Q = L the expected sum of men’s rankings of their assigned mates
2: while Q 6= ∅ do and (n + 1)((1 + 1/2 + ... + 1/n) − n) has limit 1 as
3: u = Q.dequeue() n goes to ∞. Thus if the men’s preferences are random
4: partner = nextCandidate(u) then this sum is Θ(n ln n) for large n. However, it is not
5: while ranking(partner, u) > hard to design instances where this sum is Θ(n2 ). This is
ranking(partner, suitor(partner)) do for instance true for any SM instance where the men have
6: partner = nextCandidate(u) identical preferences.
7: if suitor(partner) 6= N U LL then
8: Q.enqueue(suitor(partner)) 2.1 Generalizations of SM We next review two general-
9: suitor(partner) = u izations of the SM problem. In the stable roommates (SR)
problem one is given a set of n persons, each one with a
complete ranking of all the others persons. The objective
McVitie and Wilson [5] gave a recursive implementation
is now to pair two and two persons together, i.e. make them
of the Gale-Shapley algorithm. This algorithm also iterates
roommates, such that there is no pair (x, y) of persons where
over the men, allowing each one to make a proposal to his
x is either unmatched or prefers y to its current partner, while
most preferred woman. But if this proposal is rejected or if it
at the same time y is either unmatched or prefers x to its cur-
results in an existing suitor being rejected then the algorithm
rent partner. Just like for the SM problem, such a solution
recursively lets the just rejected man make a new proposal to
is called stable. Unlike the SM problem there might not ex-
his best remaining candidate. The recursion continues until
ist a solution for an SR instance. We also note that if some
a proposal is made such that no man is rejected (because the
persons have only ranked a subset of the other participants
last proposed to woman did not already have a suitor). At
we get the stable roommates problem with incomplete lists
this point the algorithm will continue with the outer loop and
(SRI). Irving gave an algorithm for computing a stable solu-
process the next man. When all men have been processed
tion to an SRI problem or to determine that no such solution
the algorithm is finished. It is shown in [5] that the McVitie-
exists [12]. This algorithm operates in two stages, where
Wilson algorithm produces the same solution as the Gale-
the first one is similar to the Gale-Shapley algorithm where
Shapley algorithm. We note that similarly to the Gale-
each person makes, accepts and rejects proposals. The sec-
Shapley algorithm it is possible to speed up the algorithm by
ond phase of the algorithm is slightly more involved but does
avoiding proposals that are destined to be rejected because
not change the running time of O(n2 ). For more information
the proposed to woman already has a better offer.
on the SR and SRI problems see [10, 17].
Comparing the two algorithms each man will consider
In the last generalization each person can be matched
exactly the same women before ending up with his final
with more than one person in a final solution. More formally,
partner. The only difference is the order in which this is
we are looking for a stable solution to an SM, SMI, SR, or
done. While the Gale-Shapley algorithm will maintain a
SRI instance where each person vi is matched with at most
list of men that needs to be matched, the McVitie-Wilson
b(vi ) other persons, where b(vi ) ≥ 1. Being stable again
algorithm will always maintain a solution where each man
means that no two persons vi and vj would both obtain a
considered so far is matched before including a new man in
better solution if they were to match with each other, either
the solution.
by dropping one of their current partners or if vi has fewer
We note that one can implement a non-recursive ver-
than b(vi ) partners or if vj has fewer than b(vj ) partners.
sion of the McVitie-Wilson algorithm simply by replacing
For the SM problem this gives us the many-to-many
the queue Q in Algorithm 1 by a stack and replacing the
stable assignment problem (MMSA), where each “man” and
dequeue() and enqueue() operations with pop() and push()
“woman” can be matched with several participants of the
operations respectively. To see that this will result in the
opposite sex. This was solved by Baı̈ou and Balinski [2]
McVitie-Wilson algorithm it is sufficient to first note that the
who presented a general algorithm based on modelling this
initial placement of L in Q is equivalent to an outer loop
as a graph searching problem. See also [3] for how to model
that processes each man once. Any rejected man will then
the SM problem as a graph problem [3].
be placed at the top of the stack and therefore be processed
Applying the last generalization to an SR instance,
immediately, similarly to a recursive call in the original al-
we get the stable fixtures problem [13] for which Irving
gorithm.
and Scott gave an O(n2 ) algorithm. Similarly to Irving’s
Wilson [29] showed that for any profile of womens
algorithm for SRI, this algorithm also consists of two stages, increase if (u, v) was added to M while removing any edges
where the first stage is a natural extension of the Gale- incident on either u or v from M . This observation is often
Shapley algorithm to handle that each person can participate stated as that the solution given by G REEDY does not contain
in multiple matches. any augmenting path containing three or fewer edges. An
augmenting path of length k is a path containing k edges
3 Matching Problems starting with an edge in M and then alternating between
We next explore the relationship between stable marriages edges not in M and edges in M , with the property that if
and weighted matchings in graphs. A matching M on a one was to replace all the edges on the path that belong to M
graph G(V, E) is a subset of the edges such that no two with those that are not in M then the weight of the solution
edges in M share a common end point. For an unweighted would increase.
graph the object is to compute a matching of maximum We next proceed to show that the solution given by
cardinality. For an edge weighted graph a typical problem is G REEDY can also be obtained by solving an associated SMI
to compute a matching M such that the sum of the weights (or SM) instance. To the best of our knowledge this result
of the edges in M is maximum over all matchings. Another has not been shown previously.
variant could be to compute the maximum weight matching Given an instance of the GM problem on an edge
over all matchings of maximum cardinality. weighted graph G(V, E). We define an SMI instance G0
We consider the G REEDY algorithm for computing a from G as follows. Let L and R be the sets of men and
matching of maximum weight in an edge weighted graph women respectively, both of size n = |V |. Any man li
where all weights are positive. This algorithm considers will include exactly those rj in its ranking where there is
edges by decreasing weight. In each step the heaviest re- an edge (vi , vj ) ∈ E. As the edges in G are not directed,
maining edge (u, v) is included in the matching before re- this also means that lj will rank ri . Similarly, any woman
moving any edge incident on either u or v. If the weights rj will include exactly those li in her ranking where there is
of the edges in G are unique or if a consistent tie break- an edge (vi , vj ) ∈ E. Both men and women order their lists
ing scheme is used then it follows that the solution given by decreasing weight of the corresponding edges in G. Thus
by G REEDY is also unique. In the following we will al- every (vi , vj ) ∈ E gives rise to four rankings in G0 . We call
ways assume that this is the case. It is well known that the two pairs (li , rj ) and (lj , ri ) for the corresponding pairs
G REEDY guarantees a solution of weight no worse than 0.5 of (vi , vj ).
times the weight of an optimal solution. We label the prob-
L EMMA 3.1. Given a graph G with SMI instance G0 as
lem of computing a greedy matching in an edge weighted
described above. Let further M be the greedy matching on
graph as the GM problem.
G. Then the pairs in G0 corresponding to the edges in M
Given an instance G of the GM problem one can con-
make up the unique solution to the SMI problem on G0 .
struct an equivalent instance of the SRI problem by sorting
the edges incident on each vertex u by decreasing weight, Proof. The proof is by induction on the edges of M con-
and letting this be the ranking of u’s neighbors in the SRI sidered by decreasing weight. Let (vi , vj ) be the edge of
instance. It is not hard to see that with this construction, a maximum weight in G. Then (vi , vj ) ∈ M and it also fol-
solution to the GM problem is equivalent to a stable solution lows from the construction of G0 that li will rank rj highest.
of the corresponding SRI problem. Consider the heaviest Similarly, li will also have the highest ranking among the
edge (u, v) in the graph. This is included in the GM solution men ranked by rj . Thus (li , rj ) must be included in any sta-
and the corresponding vertex pair must also be part of any ble solution of G0 . A similar argument shows that the edge
solution to the SRI instance, otherwise this solution would (lj , ri ) will also be included in such a solution.
not be stable as both u and v would prefer to match with Assume now that the pairs in G0 corresponding to the
each other over any other partner. We can thus include (u, v) k ≥ 1 heaviest edges in M must be included in any
in the solutions to both instances and also remove u and v stable solution and consider the two pairs (ls , rt ) and (lt , rs )
from further consideration. For the GM problem this means corresponding to the k + 1st heaviest edge (vs , vt ) in M .
that any edges incident on either u or v are removed and It is clear that any solution where ls is matched to a
for the SRI instance u and v are removed from all ranking woman that he has ranked after rt while at the same time
lists. One can then repeat the argument using the heaviest rt is matched to a man that she has ranked after ls , cannot
remaining edge, and it thus follows by induction that the two be stable as both ls and rt would be better of if they were to
solutions are identical. It is also clear that the corresponding match with each other. Thus if (ls , rt ) is not included in a
SRI instance always has a unique stable solution. stable solution at least one of ls and rt must be matched to a
The above construction implies that the solution given partner which he or she has ranked higher than the other one
by G REEDY is stable in the sense that there does not exist of {ls , rt }. Assume therefore that ls is matched to ru and
an edge (u, v) 6∈ M such that the weight of M would that ls has ranked ru higher than rt , implying that the weight
of (vs , vu ) is greater than the weight of (vs , vt ) in G. But stance. Also let {rs1 , rs2 , . . . , rsf } and {lt1 , lt2 , . . . , ltg } be
since (vs , vt ) ∈ M it follows that (vs , vu ) 6∈ M . Thus vu the ranked lists of the man li and the woman ri respectively.
must be matched to some other vertex vz in M . And since Then if follows from the construction of G0 that f = g and
(vs , vu ) 6∈ M the weight of (vu , vz ) must be greater than that that sk = tk for all k. Thus any proposal made to ri could
of (vs , vu ). By the induction hypothesis the pairs in G0 that be handled directly by li as he has the same information as
correspond to the k heaviest edges in M must be included ri . It follows that one can merge li and ri into one node vi
in any stable solution in G0 . It therefore follows that lz must that handles making, accepting, and rejecting proposals re-
be matched to ru in any stable solution on G0 contradicting lated to li and ri . In this way both the Gale-Shapley and
that ls is matched to ru . A similar argument shows that rt the McVitie-Wilson algorithm can be used directly on edge
cannot be matched to any man in L to which she gives higher weighted general graphs to compute greedy matchings, but
priority than she gives to ls . Thus the pair (ls , rt ) must be in now using edge weights to rank potential partners.
any stable solution in G0 . The argument for why (lt , rs ) also Irving’s algorithm [12] for solving the SRI problem
must be included in a stable solution is analogous. It follows consists of two stages, of which the first is exactly this
that any pair in G0 corresponding to an edge in M must be algorithm of making, accepting, and rejecting proposals on
part of a stable marriage in G0 . a general graph. If the rankings in an SRI instance are based
It only remains to show that once the pairs correspond- on edge weights from a GM instance then the first phase will
ing to the edges in M have been included in the solution M 0 produce the greedy solution which is stable, thus making the
to G0 , then it is not possible to match any other pairs in G0 . second phase of the algorithm redundant.
If M 0 contains a pair (li , rj ) in addition to the pairs corre- Previous efforts at designing fast parallel greedy match-
sponding to the edges in M then (vi , vj ) 6∈ M and neither vi ing algorithms have been based on the notion of dominant
nor vj can be matched in M . But since li has ranked rj (and edges. These are edges that are heavier than any of their
vice versa) it follows that (vi , vj ) ∈ E and that M can be neighboring edges. Preis showed that an algorithm based on
expanded with (vi , vj ). This contradicts that M is maximal repeatedly including dominant edges in the matching while
and the result follows. removing any edges incident on these will result in the same
solution as G REEDY [25]. Based on this observation Manne
We next consider the b-matching problem which is a
and Bisseling developed the pointer algorithm [18], which
generalization of the regular weighted matching problem
was further enhanced by Manne and Halappanavar in the
similar to the many-to-many stable assignment problem and
S UITOR algorithm [19]. We note that the S UITOR algo-
the stable fixtures problem. A b-matching on G is a subset
rithm is identical to the McVitie-Wilson algorithm applied to
of edges M ⊆ E such that every vertex vi ∈ V has at most
a general edge weighted graph, while the pointer algorithm
b(vi ) edges in M incident on it. The objective is to compute
has strong resemblances to the Gale-Shapley algorithm as
the b-matching of maximum weight. A 0.5 approximation
outlined in Algorithm 1.
can again be computed using the greedy algorithm that
The same type of relationship also holds true between
selects edges by decreasing weight and whenever b(vi ) edges
the greedy b-matching problem and the many-to-many sta-
incident on vi have been selected, the remaining edges
ble assignment problem. The algorithm presented in [2] can
incident on vi are removed [20]. Setting b(vi ) = 1 for all
be instantiated to solve the b-matching problem using a Gale-
vi ∈ V gives a regular (one) matching. See [21] for a survey
Shapley type algorithm where a vertex v will accept the b(v)
of algorithms for computing optimal b-matchings.
best offers at any given time. We note that this is the same
It is straight forward to see that the stable fixtures prob-
algorithm as the one presented by Georgiadis and Papatri-
lem is also a generalization of greedy b-matching. Given
antafilou [8] and also by Khan et al. [14] for computing a
an instance of the greedy b-matching problem, one can also
greedy b-matching. In [14] the authors experiment with what
construct an equivalent many-to-many stable assignment in-
they call delayed versus eager rematching of rejected suit-
stance by setting the bounds b(li ) and b(ri ) equal to b(vi ).
ors. The difference between these two variants is the same
A proof similar to that of Lemma 3.1 shows that these two
as that between a Gale-Shapley and a McVitie-Wilson style
problems have equivalent solutions.
algorithm.
3.1 Algorithmic Similarities As a consequence of the
4 Experiments
fact that the solution given by G REEDY can be obtained by
either solving a properly designed instance of SMI or SRI, As shown in Section 3 much of the theory that has been
any algorithm that solves either of these two problems can developed lately for greedy matching algorithms are mainly
also be used to compute a greedy weighted matching. This restricted versions of previous results from the theory of
process can be simplified as it might be possible to run an stable marriages. However, the work on greedy matchings
SMI or SRI algorithm directly on the original graph. Let has to a large extent been driven by a need for developing
G be an instance of GM and G0 its corresponding SMI in- scalable parallel algorithms for use in scientific applications.
This has lead to the implementation of Gale-Shapley and to the threads before moving on to the next round. However,
McVitie-Wilson type matching algorithms on a large variety synchronization tends to be costly, and experiments done on
of architectures, including distributed memory machines greedy matching problems indicate that this is typically not
[4, 18], multicore computers [11, 15, 19], and GPUs [1, 23]. worth the effort. For the McVitie-Wilson algorithm one can
There has been less emphasis on implementations and load balance the algorithm by using one of the dynamic load
developing working code for the stable marriages problem. balancing strategies in OpenMP such as dynamic or guided
We believe that much of the work done on greedy matchings in the initial assignment of men to threads. This strategy was
can easily carry over to developing efficient code for stable used successfully in experiments for the S UITOR matching
marriage problems. To show the feasibility of this we have algorithm on graphs with highly varying vertex degrees [19].
developed shared memory implementations of both the Gale- For the CUDA algorithm we assign one thread to each
Shapley and McVitie-Wilson algorithms. We used OpenMP man in our implementation of the McVitie-Wilson algorithm.
to parallelize the Gale-Shapley algorithm and both OpenMP Each thread then executes the algorithm similarly to the
and CUDA for parallelizing the McVitie-Wilson algorithm. OpenMP version using a CAS operation to assign a man
In weighted matching both endpoints of an edge (u, v) as the suitor of a particular woman. Using only one thread
evaluates the importance of the edge to the same number, per man allows for a larger number of thread blocks which
i.e. the weight of the edge. Whereas in the stable marriage the runtime environment can balance across the device. But
problem both u and v assign their own ranking of the as the threads within one physical warp operate in SIMD,
other. Thus the main difference between greedy matching the running time of all threads in the same warp will be
algorithms such as those presented in [18, 19] and the Gale- equal to the maximum execution time of any of the threads.
Shapley algorithm is that in the latter, a man who makes Similarly, the threads within the same thread block will not
a proposal evaluates his chance of success based on the release resources until all threads in the block have finished
woman’s ranking, instead of on a common value. Another executing. It would have been possible to statically assign
difference is that it is typically not assumed in weighted multiple men to each thread or to design a dynamic load
matching problems that the neighbor list of a vertex is sorted balancing scheme with the aim of evening out the work
by decreasing weight. It was shown in [19] that when this is load. But this would have resulted in a more complicated
the case, then it both simplifies the algorithm and also speeds algorithm and as our main goal is proof of concept we did
up the execution considerably. not pursue this.
Our parallelization strategy for the McVitie-Wilson Implementing the Gale-Shapley algorithm on the GPU
algorithm using OpenMP closely follows that of the presents additional challenges compared to the McVitie-
S UITOR algorithm as presented in [19], while our CUDA Wilson algorithm. In a Gale-Shapley algorithm the threads
version of the same algorithm is a simplified version of the would have to be grouped so that each thread group operates
S UITOR algorithm used in [23]. In both of our OpenMP on one common queue, where the size of the group could
algorithms the set of men is initially partitioned among the be either a subset of threads in a warp or all the threads
threads who then each run a local version of the correspond- in one thread block. As the number of free men monoton-
ing algorithm until completion. A thread will first search the ically decreases between rounds, there should initially be
list of the current man to locate the woman he gives highest more men than threads assigned to the same queue, some-
priority and where the woman also prefers him to her current thing that would complicate the algorithm. Also, having sev-
suitor (if any). If such a woman is discovered the thread will eral threads operate on one common queue would require
use a compare-and-swap (CAS) operation to become the new synchronization which can be time consuming on the GPU.
suitor of the woman. In this way it is assured that no other For this reason we chose not to implement the Gale-Shapley
thread has changed the suitor value. If the CAS operation algorithm using CUDA.
succeeds the previous suitor (if any) is treated according to As we are not aware of any sufficiently large publicly
the current strategy and is inserted in a local stack (McVitie- available data sets for the stable marriage problem, we have
Wilson) or a local queue (Gale-Shapley). If the CAS oper- designed two different random data sets. The first set has
ation failed because some other thread had already changed been constructed to be relatively easy to solve, whereas the
the suitor value, then if the current man can still beat the new second set is intended to be more time consuming. We
the suitor then the thread will retry with a new CAS opera- label these sets as easy and hard respectively. Each instance
tion, otherwise it will continue searching for the nest eligible consists of n men and n women. In the easy data set
woman. each man is assigned a random number ∈ [0, 1] and
There is a difference between the algorithms in how they then randomly picks and ranks (1 + ) ln n women. Each
can handle load imbalance. For the parallel Gale-Shapley woman then ranks exactly the men that ranks her. With
algorithm it is possible to synchronize the threads after each this configuration more than 98% of the participants were
round of proposals and then redistribute the unmarried men matched in every final solution and the total number of
proposals is at most 2n ln n with an average of n ln n. In
the hard data set each man has an identical complete random
ranking of all the women. Similarly, all woman share 1.2 OpenMP: McVitie-Wilson
the same random ranking of all the men. Thus there will
OpenMP: Gale-Shapley
GPU: McVitie-Wilson
Time (seconds)
Moreover, in the hard instances there will be contention 0.6
among the men for obtaining the same set of women, and
thus cause substantial synchronization requirements for the 0.4
200
3
Time (seconds)
Time (seconds)
150
2
100
1 50
0 0
5M 25M 50M 75M 100M 125M 100K 200K 300K 400K 500K
N N
400K McVitie-Wilson
400K Gale-Shapley
16 75M McVitie-Wilson 20 500K McVitie-Wilson
75M Gale-Shapley 500K Gale-Shapley
14 100M McVitie-Wilson
100M Gale-Shapley
125M McVitie-Wilson 15
12
125M Gale-Shapley
Speedup
10
Speedup
10
8
6
5
2 0
1 9 18 27 36
0
Threads
1 9 18 27 36
Threads
120M
100M McVitie-Wilson time of the Gale-Shapley algorithm is linear in the instance
100M Gale-Shapley
125M McVitie-Wilson size. Thus moderate sized problem can already be solved
125M Gale-Shapley
rapidly. In addition, our current experiments on the SMI
100M
80M
1.5B
1 9 18 27 36
References
Threads
Figure 8: TEPS 1for hard datasets [1] B. O. Fagginger Auer and R. H. Bisseling. A GPU algo-
rithm for greedy graph matching. In Facing the multicore,
Challenge II, volume 7174, pages 108–119. LNCS, Springer,
5 Conclusion 2012.
[2] M. Baı̈ou and M. Balinski. Many-to-many matching: stable
In his book Manlove [17] lists some of the most noteworthy
polyandrous polygamy (or polygamous polyandry). Discrete
open problems related to SM. One of these is to determine
Applied Mathematics, 101(1-3):1–12, 2000.
if the SM problem is in the complexity class NC or not, [3] M. Balinski and G. Ratier. Of stable marriages and graphs,
that is, to determine whether the problem can be solved by and strategy and polytopes. SIAM Review, 39(4):575–604,
an algorithm with polylogarithmic running time when using 1997.
a polynomial number of processes. Efforts at designing [4] Ü. V. Çatalyürek, F. Dobrian, A. H. Gebremedhin, M. Ha-
such algorithms has mainly resulted in parallel algorithms lappanavar, and A. Pothen. Distributed-memory parallel al-
gorithms for matching and coloring. In IPDPS Workshops, [23] Md. Naim, F. Manne, M. Halappanavar, A. Tumeo, and
pages 1971–1980, 2011. J. Langguth. Optimizing approximate weighted matching on
[5] L. B. Wilson D. G. McVitie. The stable marriage problem. nvidia kepler K40. In 22nd IEEE International Conference on
Communications of the ACM, 14(7):486–490, 1971. High Performance Computing, HiPC 2015, Bengaluru, India,
[6] L. S. Shapley D. Gale. College admissions and the stability of December 16-19, 2015, pages 105–114, 2015.
marriage. The American Mathematical Monthly, 69(1):9–15, [24] U. Naumann and O. Schenk. Combinatorial Scientific Com-
1962. puting. CRC Press, 2012.
[7] T. Feder, N. Megiddo, and S. A. Plotkin. A sublinear parallel [25] R. Preis. Linear time 1/2-approximation algorithm for max-
algorithm for stable matching. Theor. Comput. Sci., 233(1- imum weighted matching in general graphs. In STACS’99,
2):297–308, 2000. volume 1563, pages 259–269. LNCS, Springer, 1999.
[8] G. Georgiadis and M. Papatriantafilou. Overlays with pref- [26] M. J. Quinn. Designing efficient algorithms for parallel
erences: Distributed, adaptive approximation algorithms for computers. McGraw-Hill, 1987.
matching with preference lists. Algorithms, 6(4):824–856, [27] J. L. Träff. A parallel approach to the stable marriage
2013. problem. In Proceedings of the Nordic Operations Research
[9] Graph 500. http://www.graph500.org. Conference (NOAS ’97), pages 277–287, 1997.
[10] D. Gusfield and R. W. Irving. The stable marriage problem. [28] C. White and E. Lu. An improved parallel iterative algo-
The MIT press, 1989. rithm for stable matching. Extended Abstract, Companion of
[11] M. Halappanavar, J. Feo, O. Villa, A. Tumeo, and A. Pothen. IEEE/ACM International Conference for High Performance
Approximate weighted matching on emerging manycore and Computing, Networking, Storage and Analysis (SuperCom-
multithreaded architectures. Int. J. High Perf. Comput. App., puting 2013).
26(4):413–430, 2012. [29] L. B. Wilson. An analysis of the marriage matching assign-
[12] R. W. Irving. An efficient algorithm for the ”stable room- ment algorithm. BIT, 12:569–575, 1972.
mates” problem. J. Algorithms, 6(4):577–595, 1985.
[13] R. W. Irving and S. Scott. The stable fixtures problem -
A many-to-many extension of stable roommates. Discrete
Applied Mathematics, 155(16):2118–2129, 2007.
[14] A. Khan, A. Pothen, M. M. A. Patwary, N. R. Satish, N. Sun-
daram, F. Manne, M. Halappanavar, and P. Dubey. Efficient
approximation algorithms for weighted b-matching. SIAM J.
Sci. Comput.
[15] A. M. Khan, D. F. Gleich, A. Pothen, and M. Halappanavar.
A multithreaded algorithm for network alignment via approx-
imate matching. In SC, page 64, 2012.
[16] V. Knoblauch. Marriage matching: A conjecture
of Donald Knuth. Economics Working Papers,
http://digitalcommons.uconn.edu/econ wpapers/200715,
2007.
[17] D. Manlove. Algorithmics of matching under preferences.
World Scientific, 2013.
[18] F. Manne and R. H. Bisseling. A parallel approximation
algorithm for the weighted maximum matching problem. In
PPAM’08, volume 4967 of LNCS, pages 708–717. Springer,
2008.
[19] F. Manne and M. Halappanavar. New effective multithreaded
matching algorithms. In 2014 IEEE 28th International Par-
allel and Distributed Processing Symposium, Phoenix, AZ,
USA, May 19-23, 2014, pages 519–528, 2014.
[20] J. Mestre. Greedy in approximation algorithms. In Al-
gorithms - ESA 2006, 14th Annual European Symposium,
Zurich, Switzerland, September 11-13, 2006, Proceedings,
volume 4168 of Lecture Notes in Computer Science, pages
528–539. Springer, 2006.
[21] M. Müller-Hannemann and A. Schwartz. Implementing
weighted b-matching algorithms: Insights from a computa-
tional study. J. Exp. Algorithmics, 5, December 2000.
[22] Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and
James A. Ang. Introducing the Graph 500. Cray User’s
Group (CUG), 2010.