0% found this document useful (0 votes)
100 views58 pages

Thesis-Parallel Algorithms For Matching

This master's thesis examines parallel algorithms for matching problems under preferences, specifically stable marriage, b-marriage, and popular matching. It introduces parallel implementations of sequential algorithms for these problems using OpenMP and analyzes their performance on both synthetic and real-world datasets. Experimental results show speedups of up to 16x for stable marriage and 11x for popular matching compared to sequential algorithms.

Uploaded by

ithatttieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views58 pages

Thesis-Parallel Algorithms For Matching

This master's thesis examines parallel algorithms for matching problems under preferences, specifically stable marriage, b-marriage, and popular matching. It introduces parallel implementations of sequential algorithms for these problems using OpenMP and analyzes their performance on both synthetic and real-world datasets. Experimental results show speedups of up to 16x for stable marriage and 11x for popular matching compared to sequential algorithms.

Uploaded by

ithatttieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

University of Bergen

Department of Informatics
Algorithms

Parallel algorithms for matching


under preference

Student: Supervisor:
Håkon Heggernes Lerring Professor Fredrik Manne

Master Thesis
June 2017
Acknowledgement
First of all: I would like to thank Fredrik Manne for supervising me. I could not
have done this without your help. I also want to thank Anna Maria Eilertsen for
good support and relentless positive attitude
Contents

1 Introduction 5

2 Background 6
2.1 Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Atomic operations . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Stable Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Popular Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Sequential algorithms 11
3.1 Stable Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 The gale-shapley algorithm . . . . . . . . . . . . . . . . . 12
3.1.3 The mcvitie-wilson algorithm . . . . . . . . . . . . . . . 13
3.2 The b-Marriage Problem . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Maximum Size Popular Matching . . . . . . . . . . . . . . . . . . . 15
3.3.1 The huang-kavitha algorithm . . . . . . . . . . . . . . . . 15

4 Parallel algorithms 18
4.1 Parallel Stable Marriage . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 A parallel gale-shapley algorithm . . . . . . . . . . . . . 19
4.1.2 A parallel mcvitie-wilson algorithm . . . . . . . . . . . . 20
4.2 Parallel b-Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Parallel Popular Match. . . . . . . . . . . . . . . . . . . . . . . . . 22

3
CONTENTS CONTENTS

5 Exp. results 27
5.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Stable marriage results . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.1 Solution quality . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.2 Easy instances . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.3 Hard instances . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.4 Real graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.5 Xeon Phi performance . . . . . . . . . . . . . . . . . . . . . 37
5.4 b-Marriage results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Popular matching results . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusion 45

Bibliography 46

A Appendix 48
A.1 On Stable Marriages and Greedy Matchings . . . . . . . . . . . . . 48

4
Chapter 1

Introduction

Every year hundreds of thousands of students express their preferences for higher
education in Norway. Each hope-to-be student lists a couple of schools and pro-
grammes they want to be accepted to. The student orders the schools in the order
of most preferred to least preferred, and submits them to The Norwegian Universi-
ties and Colleges Admission Service (NUCAS). NUCAS is then tasked with the job
of deciding who gets to study where. While the task might seem daunting, it is not
without hope. The problem college admissions has been studied since 1962 when
Gale and Shapley introduced a solution to this problem: “College Admissions and
the Stability of Marriage”[2]. With their solution we end up with a set of accepted
applications and an important promise of stability: if a student is accepted to a
less-than-preferred college he will not get accepted to a more preferred school by
talking to them directly. Likewise for the school, if they only get students with
bad grades they will not be able to admit a better student by talking to him/her
directly.
Solving this problem for large amounts of students and colleges take time, and
as we all know: time is money. The question we want to answer is then: can
we solve problems like the college admissions problem faster by utilising multiple
processors of modern CPUs?
Previous attempts at this have been less successful [7], but recent developments
in the closely related field of weighted maximum matching suggest a new way [6][9].
In this thesis we use these ideas to present new parallel algorithms for the stable
marriage problem, the b-Marriage problem and the Popular Matching problem.
Parts of this thesis is published as the paper “On Stable Marriages And Greedy
Matchings”[8] to the SIAM Workshop on Combinatorial Scientific Computing (CSC16)
and is provided in appendix A.1. The paper received the Best Paper Award.

5
Chapter 2

Background

2.1 Parallelisation
To understand the work we have done it is useful to have some basic understand-
ing of how multi threading works on modern operating systems and some of the
tools we used to develop our parallel algorithms. We assume basic knowledge of
programming and a basic intuition of how programs compile to a sequential list of
instructions (a program).
In regular sequential applications there is usually a single execution path through
the program. Running the program, the operating system spawns a process, allo-
cates some time on a processor to the process and executes it. To ensure that other
process can interfer with the execution of the program, the memory allocated is
usually kept private. Modern computers often feature multiple processors, letting
the operating system safely run multiple programs concurrently without two pro-
grams modifying each others states. Running our program on only one processor
at a time limit us to the execution speed of a single processor.
One way to speed up the execution of our program is to run multiple processes,
each with their own memory space and a part of the problem to solve, without
any communication between the processes. As long as there are no dependencies
between the parts we run in parallel, this is a valid way to speed up the execution.
When the execution of the processes depend on each other, they need to communi-
cate. When possible, inter-process communication can be expensive time-wise and
also requires careful planning of how the processes communicate. One option is
to let the programs share memory, which is the basis for multithreaded program-
ming, letting each program (or thread) make decisions based on a shared view
of the memory. However, this can lead to what is know as a race condition. A

6
CHAPTER 2. BACKGROUND 2.1. PARALLELISATION

race condition is a situation where the result of some computation depends on the
timing of the execution. We illustrate this in Figure 2.1 where the result of two
threads executing an incrementation of a variable (a++), depends on how the oper-
ations are interleaved. In our example incrementing is implemented as a three-step
operation: 1. read the memory location to a temporary variable 2. write back the
temporary variable plus one 3. return the temporary variable.

Thread 1 Thread 2 a Value


 tmp = a 0
Time


 tmp’ = a 0
 a = tmp + 1 1
y
a = tmp’ + 1 1
return tmp return tmp’ 1
Figure 2.1: Interleaved incrementation of a

Each thread first fetches the value of a to a temporary variable, increments the
value of the temporary variable, and then store it back to variable a. Even though
we expected the final value of the variable a to be 2, the value each thread writes
back is 1.
Allowing each thread to do both operations in one go without the other thread
doing any simultaneous operations would give a correct result, but this requires
some kind of synchronisation between the threads. Synchronisation is the process
where multiple threads, through some kind of communcation, agree to pause for
some condition to occur. Synchronisation can reduce the gain we get from running
on multiple processors by doing extra work or waiting for other threads.
There are two “layers” in which we are provided tools that help us with paral-
lelising our code. One of which is OpenMP, an API for developing parallel programs
on shared-memory computers, and the other being primitives provided by the CPU
for doing safe concurrent operations.

2.1.1 OpenMP
OpenMP provides an interface for splitting the execution of a program into mul-
tiple threads. This is done by prepending a block of code with the #pragma omp
parallel preprocessing directive which, when precompiled by a compiler sup-
porting OpenMP, makes the block of code run in parallel on as many threads as
supported by the system. OpenMP also provides a set of easy-to-use functions for

7
2.1. PARALLELISATION CHAPTER 2. BACKGROUND

common tasks inside a parallel region, most notable (for our use) of which are the
#pragma omp for, omp_set_lock/omp_unset_lock, and omp_get_wtime.
The #pragma omp for takes a well-formed (no multiple exits or unpredictable
jumps) for-loop and splits it so that each thread gets a part of the iterations. It
can be configured with extra directives such as how to split the data (schedule
directive), which variables should be private to the thread (private directive) or
how to combine the result of the iterations through some kind of reduction function
(the reduction directive).
The omp_set_lock and omp_unset_lock functions are used for synchronisation
between the threads. The functions takes a shared memory location as parameter
which is used as a lock instance. Once a lock is set, it cannot be set again from any
other thread without being unset first. OpenMP also has a more coarse-grained
version of these locks, the #pragma omp critical directive which does not require
the programmer to set up a lock instance, but ensures that only one thread at any
given time will be in the region of code covered by the critical pragma. In our
algorithms we can prevent unnecessary waiting by ensuring that only threads with
overlapping memory reads/writes use the same lock instance. A region of code
protected by a lock is also called a critical section.
Some of the same functionality as OpenMP provices is available through for
example the pthreads library (with quite a bit of extra programming), but we chose
OpenMP for its ease of use and preexisting knowledge of the API.

2.1.2 Atomic operations


An operation (or a set of operations) is atomic if it appears to execute instan-
taneously from other threads. Even though several threads are executing the
operations, it will never interleave with other operations on the same memory.
Recall our example earlier with the a++ operation. As we saw, the operation is in
reality a set of the operations (b = a; a = b + 1; return b;) which may be in-
terleaved with another thread performing the same operation on a. Many modern
CPUs provide an atomic instruction for this type of situation called atomic fetch
and add. The ones we use in our programs are atomic_compare_and_swap and
atomic_fetch_and_add (Figure 2.2.

2.1.3 Load balancing


Load balancing between a set of threads is the process of moving work from a
busy thread to another less busy thread in an attempt to make the amount of

8
CHAPTER 2. BACKGROUND 2.2. MATCHING

Operation Description
tmp = *a;
atomic_fetch_and_add(*a, b) a = tmp + b;
return tmp;
if(*a != b) {
return false;
atomic_compare_and_swap(*a, b, c) }
a = c;
return true;
Figure 2.2: Relevant atomic operations

work left more even. Uneven load balancing is a situation where some thread
of the program continutes to execute tasks with other threads idling. This can
be because the problem instance is divided between the threads in a way that
leaves some threads with more work than others. This is waste of precious CPU-
time as some processors may go idle while there is still work to do. Knowing the
partitioning of the problem instance beforehand can be be computationally hard
and require too much time to process, and as a result load balancing is often done
during runtime.

2.2 Matching

In this section we give an introduction to the basic concepts and terminology of


matching. We assume the reader has some knowledge of graphs.
Given a graph G = (V, E), a matching M is a subset of edges M ⊆ E such that
no edges in M share a common endpoint. A matching problem often has some
condition to it. A maximum weighted matching M is a matching in a weighted
graph (where each edge has some weight associated to it, typically an integer or real
number), such that the sum of the edges in the matching is the greatest compared
to all other matchings in the graph. A maximum cardinality matching is a matching
M in G such that no other matching has a size (the number of edges) larger than
M.

9
2.3. STABLE MARRIAGE CHAPTER 2. BACKGROUND

2.3 Stable Marriage


The stable marriage problem is defined as follows: Given a set L of n men, a set
R of m women, how the men rank the women, and how the women rank the men.
Can we find a matching between the men and the women such that the matching
is stable? A matching is stable if it contains no blocking pair. A pair (l, r), where
l ∈ L and r ∈ R, is a blocking if it is not in the matching while both l and r prefer
each other over their current match.
There are many variants to the problem, but we only look at two: the stable
marriage problem (SM) and the stable marriage with incomplete lists (SMI). The
difference between SM and SMI is that where in SM each man ranks each woman
and vice versa, this might not be true for an instance of SMI. We call instances of
these problems complete and sparse respectively.
The problem of greedy matching has been shown to reduce to an instance of
stable marriage.[8]

2.4 Popular Matching


In the problem of maximum size popular matching we, instead of trying to find a
stable matching, try to maximise the number of men and women who prefer the
matching to any other possible matching. More formally we say that a matching
M 0 is more popular than M if there is more men and women that prefer M 0 to M
than the other way around. A matching M is popular if there are no matchings
M 0 that is more popular than it [4]. Note that a maximum size popular matching
might not be stable, but complete stable matchings (where every vertex of the
graph is matched to some other vertex) are popular. This means we allow our
matching to become unstable while increasing the matching size.

10
Chapter 3

Sequential algorithms

In this chapter we present background material necessary to understand our parallel


algorithms presented in Chapter 4. While the sequential algorithms presented in
this chapter is not our work, the implementations of them are.

3.1 Stable Marriage


In this section we present two existing O(N 2 ) algorithms for solving the problem:
the original algorithm by Gale and Shapley, and a recursive variant by McVitie
and Wilson.

3.1.1 Introduction
The problem of college admissions was first introduced by Gale and Shapley with
their algorithm for solving the problem of admitting students to colleges. In [2],
they describe the familiar situation where the students on one side wants to be
accepted by the best (based on some personal ranking) college, while the colleges
on the other side wants to admit their best (by grades or other measures) ranked
students.
Gale and Shapley gave their solution (which we call the gale-shapley algo-
rithm) to another problem: the stable marriage problem. In the stable marriage
problem each side has a quota of 1 to fill. Instead of students and colleges there
are men an women ready to marry. McVitie and Wilson later expanded on the
gale-shapley algorithm [10], by giving a recursive version we call the mcvitie-
wilson algorithm.

11
3.1. STABLE MARRIAGE CHAPTER 3. SEQUENTIAL ALGORITHMS

3.1.2 The gale-shapley algorithm


The gale-shapley algorithm is based on simulating the process of finding a
matching, with one side proposing and the other side rejecting or accepting pro-
posals based on their preferences. The algorithm operates in rounds. In each round
it picks men from a list of unmatched men. Each of these men propose to his most
preferred woman among those he has yet to propose to. Each woman then rejects
all but their current best offer. The men with rejected proposals are put back on
a list of unmatched men. We obtain our stable matching when, at the end of a
round, all the men either have pending proposals or are out of women to propose
to. In Algorithm 1 we give pseudo-code for the gale-shapley algorithm.
In Algorithm 1 we define ranking(r, l) to return r’s ranking of l as a number
from 1 to n with 1 being the best ranked and n being the worst. ranking(r, N U LL)
returns n+1 to ensure that any partner is better than no partner. nextCandidate(l)
returns r’s sequence of ranked partners one by one with each successive call, starting
with the best ranked partner. The variable suitor(r), which represents the current
best proposer to r, is initially defined as N U LL. Q is a queue that supports the
operations enqueue(l) to add an element from L to the queue, and dequeue() to
remove and return the first element from the queue. The algorithm differs slightly
from the original one by putting rejected men onto a queue, reducing the needed
to scan through L for each round.

Algorithm 1 The gale-shapley algorithm


Q=L
while Q 6= ∅ do
u = Q.dequeue()
partner = nextCandidate(u)
while partner 6= N U LL ∧
ranking(partner, u) > ranking(partner, suitor(partner)) do
partner = nextCandidate(u)
end while
if partner 6= N U LL then
if suitor(partner) 6= N U LL then
Q.enqueue(suitor(partner))
end if
suitor(partner) = u
end if
end while

12
CHAPTER 3. SEQUENTIAL ALGORITHMS
3.2. THE B-MARRIAGE PROBLEM

Algorithm 1 terminates when there are no more elements on the queue, at which
point the set of edges {(suitor(r), r) | r ∈ R} is our matching. In SM instances,
where each man ranks each woman and vice versa, it is shown that Algorithm 1
always produce a complete stable solution.[2] This may not be true for the case
of SMI instances, which will be stable but not all men and women might get a
match. To support SMI instances we modify the nextCandidate(l) routine to
return N U LL after the last candidate.

3.1.3 The mcvitie-wilson algorithm


The algorithm proposed by McVitie and Wilson[10] is based on the gale-shapley
algorithm, together with the observation that the order in which the suitors propose
does not change the set of matched vertices. [3] The algorithm starts out with an
empty set of men, adding men to the solution while keeping the solution stable.
The algorithm proceeds through the set of men, adding one at a time to the
solution by letting him propose to his best ranked woman that he has yet to
propose to. As soon as a woman has more than one proposal the solution is no
longer stable. In this case the woman rejects the worst ranked man. The rejected
suitor is recursively re-added to the solution again by letting him propose to his
next best choice. The recursive procedure terminates when a man proposes to a
woman with no previous proposals. At this point the algorithm continues with the
next unmatched man. In Algorithm 2 we present a pseudo-code implementation
of the mcvitie-wilson algorithm.
Algorithm 2 terminates when there are no more members in L to loop over and
the call to introduceSuitor returns, at which point the set of edges {(suitor(r), r) | r ∈
R} is our stable matching. It is shown that the solution of the mcvitie-wilson
algorithm produces the same result as the gale-shapley algorithm.[10]

3.2 The b-Marriage Problem


In this section we introduce the b-marriage problem, and present an O(N 2 ) algo-
rithm for solving it.

3.2.1 Introduction
The b-marriage problem is similar to the previously described stable marriage
problem in Section 2.3. The difference being that each vertex can be specified to
be matched with any number of neighbours. This is closer to the original college

13
3.2. THE B-MARRIAGE PROBLEM
CHAPTER 3. SEQUENTIAL ALGORITHMS

Algorithm 2 The mcvitie-wilson algorithm


procedure mcvitie-wilson
for each l in L do
introduceSuitor(l)
end for
end procedure
procedure introduceSuitor(u)
partner = nextCandidate(u)
while partner 6= N U LL ∧
ranking(partner, u) > ranking(partner, suitor(partner)) do
partner = nextCandidate(u)
end while
if partner 6= N U LL then
rejected = suitor(partner)
suitor(partner) = u
if rejected 6= N U LL then
introduceSuitor(rejected)
end if
end if
end procedure

admissions problem where one college can be assigned multiple students. For the
college admissions problem, b is the number of students the college can admit.
For the students, on the other hand, b is fixed to 1 as one student should not get
admitted to multiple colleges simultaneously. In this case where only one side of
the problem has a b value greater than 1 this can be solved with an algorithm for
the stable-marriage problem by duplicating the vertices for colleges b times (and
any incident edges). If both sides has a b > 1 then the copying technique, applied to
both sides can give a situation where one student is admitted to the same college
multiple times. Instead of dealing with colleges and students, we continue with
men and women. Imagining a situation where we have to find a set of polygamous
stable marriages. Each man and woman can have individual “quotas” on how many
of the other sex he/she wants to be matched with. In our experiments the quotas
for the men and women will be uniform, but individual quotas does not change
the problem. We denote the number of wives for man l as bm (l) and the number
of husbands for woman r as bw (r). We only look at the case where both bm (l) and
bw (r) is > 1 ∀(l, r) ∈ LXR.

14
CHAPTER 3. SEQUENTIAL ALGORITHMS
3.3. MAXIMUM SIZE POPULAR MATCHING

This algorithm is based on mcvitie-wilson but draws inspiration from the


b-suitor [9] algorithm, which is an approximation algorithm for weighted b-
matching.
It has been shown that the problem of greedy weighted b-matching, a more
general matching problem on weighted graphs, can be reduced to the b-marriage
problem [A.1].

3.2.2 Algorithm
In both the gale-shapley (Section 3.1.2) and the mcvitie-wilson algorithms
(Section 3.1.3) a woman with more than one proposal rejects the worst ranked.
The algorithm for solving b-marriage is similar to these algorithms, but allows a
woman w to have up to bw (w) proposals before she has to reject one. This is done
by introducing a variable pq(w) which returns a max-priority-queue (based on the
ranks of w) capable of storing the bw (r) + 1 suitors for woman w. Initially all
priority queues are empty. When a man proposes to a woman w the proposal is
added to the queue p(w). If the size of w’s queue |pq(w)| ≤ bw (w) the worst ranked
suitor is rejected (by a dequeue operation) and is recursively added to the solution,
trying his next candidate. Observe that matching a man to bm women is the same
as matching bm copies of the man with each their woman, as long as two copies
of the same man is not matched to the same woman. We use this observation
in the algorithm and introduce each man bm times to the solution. Because the
copies of a man share the nextCandidate(l) pointer, which is increasing with each
introduction, we avoid matching two copies of a man to the same woman.
The algorithm terminates when all the copies of the men has been introduced
to the solution and the recursive calls have returned. At this point the contents of
the queues holds the edges of the stable matching. Each woman w is matched is
matched to each suitor in her queue pq(w).
In Algorithm 3 we give a pseudo-code implementation of this algorithm.

3.3 Maximum Size Popular Matching


In this section we present an algorithm by Huang and Kavitha[4] with an O(|E| ∗
|N0 |) runtime, where N0 = min(L, R) and |E| is the total number of rankings.

3.3.1 The huang-kavitha algorithm


Recall the introduction to the problem in Section 2.4

15
3.3. MAXIMUM SIZE POPULAR
CHAPTER
MATCHING
3. SEQUENTIAL ALGORITHMS

Algorithm 3 The b-marriage algorithm


procedure b-marriage
for each l in L do
for each c in c | c > 0, c < bm (l) do
introduceSuitor(l)
end for
end for
end procedure
procedure introduceSuitor(u)
partner = nextCandidate(u)
while (partner 6= N U LL ∧
ranking(partner, u) > ranking(partner, bpq(partner).peek()c)) do
partner = nextCandidate(u)
end while
if partner 6= N U LL then
pq(partner).enqueue(u)
if |pq(partner)| > bw (partner) then
introduceSuitor(pq(partner).dequeue())
end if
end if
end procedure

The huang-kavitha algorithm works by first computing a stable matching


(we can find this with an algorithm solving the stable marriage problem, e.g. the
mcvitie-wilson algorithm). Any men rejected by all his ranked women is imme-
diately “lifted” to a higher level and re-introduced to the solution. Men at the higher
level is always preferred by women to men at the lower level. If a man at the higher
level is rejected by all his ranked women he will remain unmatched. The algorithm
terminates when all men have either been matched or been by his ranked women
at the higher level. The matching returned by the mcvitie-wilson-algorithm is
then the maximum size popular matching.
We base our algorithm on the mcvitie-wilson algorithm presented in Sec-
tion 3.1.3 and present the pseudocode in Algorithm 4.
In Algorithm 4 we introduce the variable level(l) which stores the current level
of a man l. level(l) is initialised to 0 for every man. We also introduce the
operation resetN extCandidate(l) which resets the sequence of candidates returned
by nextCandidate(l) back to the beginning of the sequence. The peek() operation

16
CHAPTER 3. SEQUENTIAL ALGORITHMS
3.3. MAXIMUM SIZE POPULAR MATCHING

on a queue returns the first element from the queue without removing it.

Algorithm 4 The huang-kavitha algorithm


procedure huang-kavitha
for each l in L do
introduceSuitor(l)
end for
end procedure
procedure introduceSuitor(u)
repeat
partner = nextCandidate(u)
levelu = level(u)
if partner 6= N U LL then
levelp = level(suitor(partner))
ranku = ranking(partner, u)
rankp = ranking(partner, suitor(partner))
end if
until partner == N U LL ∨
levelu > levelp ∨ (levelu == levelp ∧ ranku < rankp )
if partner 6= N U LL then
rejected = suitor(partner)
suitor(partner) = u
if rejected 6= N U LL then
introduceSuitor(rejected)
end if
else
level(u) = level(u) + 1
resetN extCandidate(l)
introduceSuitor(u)
end if
end procedure

17
Chapter 4

Parallel algorithms

In this chapter we introduce our main contributions: a set of parallel algorithms


for solving the stable marriage problem, the b-marriage problem, and the popular
matching problem.

4.1 Parallel Stable Marriage


In this section we present two parallel algorithms for solving the stable marriage
problem: a parallel version of the gale-shapley algorithm presented in Algo-
rithm 1, and a parallel version of the mcvitie-wilson algorithm presented in
Algorithm 2.
There are previous parallel algorithms solving the stable marriage problem.
One algorithm by Hull [5] models the men and woman as agents, with the men-
agents sending proposal-messages to the women and the woman-agents accepting
and rejecting based on their preferences. Tseng and Lee proposed [12] an algorithm
based on the divide-and-conquer approach where partial solutions are solved for
subsets of the men and merged together by solving the cases of conflict (two men
in different subsets can be matched with the same woman). In this approach each
subset can be solved in parallel without any synchronisation, however the merging
would require synchronisation to be done in parallel. Larsen [7] presented two
parallel algorithms for a distributed memory computer. One of the gale-shapley
algorithm and one of the algorithm by Tseng and Lee.
More recent advances in parallel algorithms for the weighted maximum match-
ing problem like the suitor[9] algorithm by Manne and Bisseling, bears strong
resemblance to the mcvitie-wilson and gale-shapley algorithms when ap-
plied to general edge weight graphs [A.1]. This leads us to believe we can use some

18
CHAPTER 4. PARALLEL ALGORITHMS
4.1. PARALLEL STABLE MARRIAGE

of the same techniques for parallelising the mcvitie-wilson and gale-shapley


algorithms as was used for the parallel suitor algorithm.

4.1.1 A parallel gale-shapley algorithm


We now present our parallel gale-shapley algorithm and explain some design de-
cisions. We base the parallel algorithm on the sequential gale-shapley algorithm
as presented in Section 3.1.
In the sequential algorithm the list of men is processed in sequence. As noted
in Section 3.1.3, the order the men are processed does not affect the outcome.
This lets us split L into |L|
p
sized partitions, where p is the number of threads. We
then execute a modified gale-shapley algorithm on these partitions in parallel,
sharing the suitor(r) variable between the threads. The new modified gale-
shapley algorithm has to take into account that proposals can be checked in
parallel. Observe that the checking can be done in parallel, but have to be executed
atomically: We can check if a man is a better suitor, but exchanging the value of
suitor(r) has to be done without anyone changing the value in the mean time.
This is because we might end up with a situation where a proposal is rejected
twice and one proposal is lost as exemplified in Figure 4.1. In Figure 4.1 both
thread 1 and thread 2 are proposing simultaneously to woman a. On Thread 1
man x wants to propose to a and on Thread 2 man y wants to propose to a. Both
threads checks simultaneously if their man is better ranked than z, a’s current
suitor. Both threads see their man as better and update the value of suitor(a).
Observe that this also enqueues two copies of z to each threads queue.
Thread 1 Thread 2 suitor(a)
rejected = suitor(a) z
rejected = suitor(a) z
 if (ranking(a,x) <
Time


 ranking(a,rejected)) z
if (ranking(a,y) <

z
y
ranking(a,rejected))
Q.enqueue(rejected) z
Q.enqueue(rejected) z
suitor(a) = x x
suitor(a) = y y
Figure 4.1: Example of interleaved proposal

One way to mitigate this issue is to lock the threads before updating, one per

19
4.1. PARALLEL STABLE MARRIAGE
CHAPTER 4. PARALLEL ALGORITHMS

woman to ensure that each thread gets exclusive access to this critical section.
As noted in Section 2.1 once a thread has obtained a lock, no other thread can
obtain the same lock before it is unlocked. An alternative method is to use atomic
compare and swap on the value of suitor. An atomic compare and swap fails if
another thread has already changed it from the value of rejected, letting the thread
fetch the new value for suitor(a) and retry the operation.
We opted for a setup where each thread has their own queue of suitors. The
alternative is one shared queue with lock-protected access, but this creates con-
tention as the thread count grows. Our approach reduces the sequential portion
of the program by not having to lock on a shared queue. The downside being that
the lengths of the queues might become unbalanced without any ability to share
work between threads. Note that as the algorithm progresses the men will move
between queues.
In Algorithm 5 we present our pseudo-code for the parallel gale-shapley
algorithm. In Algorithm 5 the parfor operation splits the men in L into |L| p
parti-
tions, where p is the number of threads, iterating the different partitions in different
threads. The men are added to the queue Q, which contains the rejected or not
yet matched men. One important invariant kept is that a man can not exist in
more than one queue at at any time. This invariant is useful as we can progress
the nextCandidate(u) sequence from one thread without worrying about other
threads interfering. Once a potential partner for man u has been found the current
suitor of the partner is swapped with u using the CAS(∗a, b, c) operation. The
CAS(∗a, b, c) operation is the compare-and-swap operation as presented in Sec-
tion 2.1.2, which will only update (atomically) the suitor(partner) value if it still
is the same as opponent. If the value of suitor(partner) was updated since we read
the value the operation will fail as the CAS operation returns another opponent
than the one we previously compared our current suitor against. The current man
u is then retried against the new opponent. If he cannot beat the new opponent
we put the man back on the queue to be retried against a new partner.
The algorithm terminates when all the queues are empty.

4.1.2 A parallel mcvitie-wilson algorithm


We will now present our parallel mcvitie-wilson algorithm. The parallel al-
gorithm is based on the sequential mcvitie-wilson algorithm as presented in
Section 3.1.3, but borrows a lot of inspiration from the parallel gale-shapley
algorithm and the parallel suitor algorithm[9].
As with the parallel gale-shapley algorithm, the set L containing the men is

20
CHAPTER 4. PARALLEL ALGORITHMS 4.2. PARALLEL B-MARRIAGE

partitioned and each partition is processed in parallel by a modified introduceSuitor


procedure. The same observations hold for the parallel mcvitie-wilson algorithm
as for the parallel gale-shapley algorithm: we can check if u is a prospective
partner for a woman in parallel, but the rejection has to be done atomically. The
modified introduceSuitor procedure uses compare and swap to atomically replace
the value of suitor(partner) if no other thread modified it since the decision to
replace was made. If a change was made to the suitor(partner) variable the algo-
rithm immediately retries u against the womans new suitor. If the new opponent
is better ranked than u the man is considered as rejected and is introduced into
the solution again starting at his next candidate.
A notable difference to the gale-shapley algorithm is, as noted in Sec-
tion 3.1.3, that when the last thread is done processing its part of the problem
the algorithm is done. This property allows the parallel mcvitie-wilson algo-
rithm to allocate parts of L to threads on the go as they are finished with their
work, preventing load imbalance.
In Algorithm 6 we present pseudo-code for the algorithm.

4.2 Parallel b-Marriage


In this section we introduce a parallel algorithm for the b-Marriage problem, and
give pseudo-code for it. The parallel algorithm is based based on the sequential
b-marriage algorithm presented in Section 3.2, but draws inspiration from the
parallel b-suitor algorithm by Khan et.al.[6]
In the parallel version of the algorithm, similarly to the parallel mcvitie-
wilson and gale-shapley algorithms, the set of men is split among the threads.
The men are then processed by a modified b-marriage algorithm.
Recall the sequential b-marriage algorithm in Algorithm 3 and observe that
the nextCandidate sequence is shared between the b copies of the men. In the
sequential version the implementation of nextCandidate is not crucial, but in a
multithreaded setting there might be multiple processors trying to advance this
pointer at the same time. This is because different copies of the same man can be
the current man of multiple threads concurrently. This can lead to inconsistent
results like a man being matched multiple times with the same woman, if the
counter is incremented as shown in Figure 2.1. Using atomic operations, like atomic
fetch and add, we can safely increment the counter. Atomic fetch and add gives
each thread a unique candidate at a little overhead relative to the simple write[11].
The second part that requires synchronisation is when we check wether or not a
man will be considered for a woman. Here we implemented the check-lock-recheck

21
4.3. PARALLEL POPULAR MATCH.
CHAPTER 4. PARALLEL ALGORITHMS

method where we first check if a man can win against the current opponent (the
worst ranked man that has proposed), lock and check if that man still is a better
suitor for the partner, and add the man to the priority queue. If the man no longer
is a better suitor he is considered rejected and is immediately recursively retried
with his next partner. The check before locking lets the algorithm avoid locking
for cases where there is no chance of the application being accepted by the college.
Using thread-safe priority-queue implementations would avoid locking here, but we
want our algorithm to not depend on the safety of the queue implementation.
In Algorithm 7 we give our implementation of the parallel b-marriage algo-
rithm. In this implementation we show a new implementation of nextCandidate(l)
which is thread-safe. To implement the nextCandidate we introduce a new variable
candidate(l) which we can atomically fetch and add (FAA) (see Section 2.1.2).

4.3 Parallel Popular Matching


In this section we present our parallel algorithm for popular matching based on
the algorithm presented in Section 3.3.1, and present an implementation of the
algorithm based on Algorithm 4.
The parallel huang-kavitha algorithm works much in the same way as the
parallel mcvitie-wilson algorithm. Recall that in the sequential algorithm (Al-
gorithm 4) we compare both the level and the ranking our current man with that
of the potential partners current suitor. In the parallel algorithm, between the
time we check if u is a better suitor and the time we compare and swap the value
of suitor(partner), the suitor may have leveled up. In this case the current suitor
should win against our prospective suitor. To solve this we change the suitor(l)
variable to be a tuple of the suitor id and his level, and do compare and swap
on the whole tuple. This ensures that if the level of the suitor id has changed,
the compare and swap fails. As with the parallel mcvitie-wilson algorithm, in
the case the CAS operation returns another opponent than the one we previously
compared our current suitor against, we have to retry against the new opponent.
If he cannot beat the new opponent we immediately reintroduce the man to the
solution.

22
CHAPTER 4. PARALLEL ALGORITHMS
4.3. PARALLEL POPULAR MATCH.

Algorithm 5 The parallel gale-shapley algorithm


#omp parallel do
Q=∅
parfor each l in L do
Q.enqueue(l)
end parfor
while Q 6= ∅ do
u = Q.dequeue()
partner = nextCandidate(u)
opponent = suitor(partner)
while partner 6= N U LL ∧
ranking(partner, u) > ranking(partner, opponent) do
partner = nextCandidate(u)
opponent = suitor(partner)
end while
while partner 6= N U LL do
if ranking(partner, u) < ranking(partner, opponent) then
if CAS(suitor(partner), opponent, u) == opponent then
Q.enqueue(opponent)
partner = N U LL
else
opponent = suitor(partner)
end if
else
Q.enqueue(u)
partner = N U LL
end if
end while
end while
end parallel

23
4.3. PARALLEL POPULAR MATCH.
CHAPTER 4. PARALLEL ALGORITHMS

Algorithm 6 The parallel mcvitie-wilson algorithm


procedure parallel-mcvitie-wilson
#omp parallel do
parfor each l in L do
introduceSuitor(l)
end parfor
end parallel
end procedure
procedure introduceSuitor(u)
partner = nextCandidate(u)
opponent = suitor(partner)
while partner 6= N U LL ∧
ranking(partner, u) > ranking(partner, opponent) do
partner = nextCandidate(u)
opponent = suitor(partner)
end while
while partner 6= N U LL do
if ranking(partner, u) < ranking(partner, opponent) then
if CAS(suitor(partner), opponent, u) == opponent then
introduceSuitor(opponent)
partner = N U LL
else
opponent = suitor(partner)
end if
else
introduceSuitor(u)
partner = N U LL
end if
end while
end procedure

24
CHAPTER 4. PARALLEL ALGORITHMS
4.3. PARALLEL POPULAR MATCH.

Algorithm 7 A parallel b-marriage algorithm


procedure parallel-b-marriage
#omp parallel do
parfor each l in L do
for each c in c | c > 0, c < bm (l) do
introduceSuitor(l)
end for
end parfor
end parallel
end procedure
procedure nextCandidate(u) return F AA(candidate(l), 1)
end procedure
procedure introduceSuitor(u)
partner = nextCandidate(u)
while (partner 6= N U LL ∧
ranking(partner, u) > ranking(partner, pq(partner).peek())) do
partner = nextCandidate(u)
end while
if partner 6= N U LL then
lock(partner)
if ranking(partner, u) <
ranking(partner, pq(partner).peek())) then
pq(partner).enqueue(u)
unlock(partner)
if |pq(partner)| > bw (partner) then
introduceSuitor(pq(partner).dequeue())
end if
else
unlock(partner)
introduceSuitor(u)
end if
end if
end procedure

25
4.3. PARALLEL POPULAR MATCH.
CHAPTER 4. PARALLEL ALGORITHMS

Algorithm 8 The parallel huang-kavitha algorithm


procedure parallel-huang-kavitha
#omp parallel do
parfor each l in L do
introduceSuitor(l)
end parfor
end parallel
end procedure
procedure introduceSuitor(u)
repeat
partner = nextCandidate(u)
levelu = level(u)
if partner 6= N U LL then
ranku = ranking(partner, u)
(opponent, levelo ) = suitor(partner);
ranko = ranking(partner, opponent)
end if
until partner == N U LL ∨
levelu > levelo ∨ (levelu == levelo ∧ ranku < ranko )
if partner == N U LL then
level(u) = level(u) + 1
resetN extCandidate(l)
introduceSuitor(u)
else
while partner 6= N U LL do
if ranking(partner, u) < ranking(partner, opponent) then
if CAS(suitor(partner), (opponent, levelo ), (u, levelu ))
== (opponent, levelo ) then
introduceSuitor(opponent)
partner = N U LL
else
(opponent, levelo ) = suitor(partner);
ranko = ranking(partner, opponent)
end if
else
introduceSuitor(u)
partner = N U LL
end if
end while
end if
end procedure 26
Chapter 5

Experimental results

In this chapter we present the runtime results of the parallel algorithms presented
in Chapter 4

5.1 Test setup


Our test setup is the computer Lyng with two Xeon E5-2699 v3 CPUs running
at 2.30GHz with 252GB ram. Each of the CPUs has 18 cores. Connected to the
machine is a Xeon Phi 7120 “Knights Corner” co-processor running at 1.23GHz with
16GB GDDR5 RAM and 61 active cores, one being reserved by the management
OS running on it. The Xeon Phi cores has 4 hardware threads each, totaling to 244
hardware threads (240 available to OpenMP, 4 being reserved by the coprosessor
operating system), while the Xeon E5 has 2 threads per core, totaling to 72 threads.
For some experiments we used a second system “brake” with 240GB RAM and 40
cores. The specs are also listed in Figure 5.1. The machines are shared memory
computers. This means all the cpu cores have direct access to all the memory.
All programs were written in C++ and compiled with the Intel compiler icpc
(ICC) 17.0.3 20170404 with -O3 optimisation enabled. We had gcc available but
it was not used because of the lacking support for offloading to the xeon phi.

Name CPU L1 cache L2 cache L3 cache


Brake 4x Xeon E7-4850 @ 2.00GHz 10x32KB 10x256KB 24MB
Lyng 2x Xeon E5-2699 v3 @ 2.30GHz 18x32KB 18x256KB 45MB
Xeon Phi Xeon Phi 7120 @ 1.24 Ghz 61x32KB 61x512KB None

Figure 5.1: Machine CPU specifications

27
5.2. INSTANCES CHAPTER 5. EXP. RESULTS

For each of the algorithms we first run the sequential version where all
OpenMP/parallelisation-related code has been removed. The sequential version is
optimised for single-thread speed and does not have to care about race conditions
with other threads. The run time measured on the best sequential algorithm is
used as a baseline when calculating the speedup gained from parallelisation.
The run times in the parallel versions are measured using the OpenMP built-in
omp_get_wtime function. The timing includes the allocation and initialisation of
memory for use in the algorithm, but not the loading of input data.

5.1.1 Xeon Phi


The Xeon Phi coprocessor is a stripped-down computer with its own processors
and memory. It is connected to a host computer via the PCIe bus, allowing for
high-bandwidth asynchronous transfer of data between the host and the Phi. The
bandwidth of the PCIe-bus (roughly 16GB/s) is significantly lower (> 22x slower)
than the on-card memory bandwidth (at 352 GB/s), so using the host memory
directly is out of the question as the PCIe-bus would quickly become a bottleneck.
The mic-architecture of the Xeon Phi builds upon a high number of stripped down,
clocked down 64-bit x86 cores with a high-speed interconnect. The cpu cores are
optimised for parallel numerical applications with its 4 hardware threads and extra
wide vector instructions.
There are two modes of using the Xeon Phi:

• Native: compiling the program for the mic architecture, logging in to the
linux-os running on the card and then execute the program.

• Offload: using OpenMP pragmas like offload, which transfers the given func-
tion and any dependent data via the PCIe-bus onto the card, resuming exe-
cution on the card.

We primarily used the offload mode because the native mode could make the card
unstable. This would happen if the program was to run out of memory, in which
case the card would hang rebooted.

5.2 Instances
To evaluate the effectiveness of our parallelisation we run our algorithms on a
combination of random graphs and real graphs. We constructed 2 different types of
random graphs: an “easy” random graph with a realtively low number of randomly

28
CHAPTER 5. EXP. RESULTS 5.2. INSTANCES

Table 5.1: Easy instances


Problem n Edges Min. degrees Max. degree
10M 10000000 340007576
20M 20000000 709985096
40M 40000000 1480074907
50M 50000000 1849916651
log(n) 2log(n)
60M 60000000 2220111233
75M 75000000 2887538634
100M 100000000 3850014406
125M 125000000 4812489007

Table 5.2: Hard instances


Problem n Edges Min. degree Max. degree
100K 100K 10B 100K − 1 100K − 1
200K 200K 40B 200K − 1 200K − 1
300K 300K 90B 300K − 1 300K − 1
400K 400K 160B 400K − 1 400K − 1
500K 500K 250B 500K − 1 500K − 1

distributed edges (between log(n) and 2 · log(n)) per vertex and a “hard” instance
where each side had a common ranking and (|M | · |W |) edges.
The easy instances were created by assigning a random number of edges between
log(n) and 2log(n) to each man. For each man the ranks were then shuffled. These
random graphs are considered easy because the average number of proposals to
reach a stable matching is < 5. Table 5.1 shows the easy graphs we generated. The
125M instance was the largest one we could fit in RAM on lyng without resorting
to swapping data out and in from disk. Because of the restrictive memory limit on
the Xeon Phi, the largest graph we could fit was at 20M vertices.
What we call a hard instance is the case where each man share the same prefer-
ence for the women, meaning they rank them equally, and the woman do the same
for the men. We consider this hard because in any stable matching on this instance
type only one suitor would get his best ranked, only one would get his second best
ranked and so on. This means our algorithms considers 1 + 2 + 3 + .. + n edges,
summing up to (n−1)·n
2
edges. Table 5.2 shows the hard instance we generated.
The real world graphs we use are listed in Table 5.3. The graphs were re-
trieved from the SuiteSparse Matrix Collection (formerly known as the University
of Florida Sparse Matrix Collection).[1] For these problems we ran a reduction as

29
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS

Table 5.3: Real world problems


Problem n edges avg.degree max.degree
Fault_639 638802 58506732 44 1268
kron_g500-logn21 2097152 364168040 88 855616
Reuters911 13332 592152 48 9060

described in the paper by Manne, Naim, Lerring, and Halappanavar (AppendixA.1)


where it is shown that the problem of Greedy Matching can be reduced to an in-
stance of stable marriage. The reduction requires each side of the bipartite graph
to contain all the vertices. This doubles the number of vertices, but for the stable
marriage problem we let n be the size of either side. Because the original graph
is undirected and the stable marriage instance is directed the number of edges is
multiplied by 4. It is worth noting that the resulting stable matching is also a
greedy matching in the original graph.
We ran both our sequential versions of the algorithms against the datasets,
measuring which one was fastest and compared the runtime of the fastest against
each of the parallel algorithms. This gives us the speedup gained by utilising
multiple processors. We ran each algorithm/instance size pair 3 times and took
the median runtime. This is to avoid any external factors like other jobs running
on the machine that might disturb the timing. The parallel runs were done on the
Lyng machine at UiB, as described in Section 5.1, with up to 72 threads on the
host machine and up to 240 threads on the Xeon Phi.

5.3 Stable marriage results


In this section we present our experimental results. For each algorithm we first
present the sequential benchmarks, then the parallel runtimes, speedup and effi-
ciency. For brewity sake we present only smaller subsets of the instances, with the
note that the rest of the results tell a similar story.

5.3.1 Solution quality


As noted in Section 2.3, a stable marriage in an SMI instance may not be a complete
match. The total solution quality for the easy instances are quite good, as can be
seen in Figure 5.2 where the number of men and women with a match is 98% ±
0.2. Note that any algorithm solving the stable marriage problem with incomplete
preference lists obtains the same solution size on these instances [3].

30
CHAPTER 5. EXP. RESULTS 5.3. STABLE MARRIAGE RESULTS

100

Average edges traversed per man


8
98
% Matched

96 6

94 4

92
Average rank
2
Easy instances Min. number of edges
90
10M 40M60M 100M125M 10M 40M60M 100M125M
Problem Size Problem Size

Figure 5.2: Easy instances solution qual- Figure 5.3: Easy instances average mens
ity rank

The solution quality is expected as the average final rank in the man-optimal
solution is < 5, as showin in Figure 5.3.

5.3.2 Easy instances


In Figure 5.4 we see the sequential run-
time of the two variants of the algo- 4
mcvitie-wilson
Time (seconds)

rithm. The gale-shapley algorithm


3 gale-shapley
being a few percent faster in every in-
stance. We believe this is because the
2
gale-shapley algorithm starts out
with the first queue in the order the 1
suitors are stored in memory, leading
to good cache hit rates on the arrays 0
10M 30M 50M 75M 100M 125M
storing information about the men like N
the nextCandidate. In our c++ imple-
mentation each man takes 12 bytes. 8 Figure 5.5: Easy instances runtime on
bytes for a lookup table and 4 bytes for Lyng (less is better)

31
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS

mcvitie-wilson p = 72
Time (seconds) 80 gale-shapley p = 72

60

40

20

0
10 40 60 75 100 125
N (millions)

Figure 5.4: Easy instances sequential runtimes on Lyng (less is better)

the next candidate rank. This allows


for a prefetch of 8 values of the lookup table and 16 values of the next candidate
rank per cache line.
After the first queue is consumed the algorithm starts processing the “rejects”
of the previous round. These next men are now in random order. The mcvitie-
wilson algorithm does not obtain as good caching of these two arrays, as it jumps
around in the memory from the start, stabilising the solution before letting a
new suitor propose. Our timings show that the gale-shapley algorithm spends
around 33% of its time in this first memory-sequential phase.
Figure 5.5 show the runtimes on the parallel gale-shapley algorithm. Here
we see that the trend with gale-shapley algorithm getting better runtimes also
extends to the parallel algorithm.
Figure 5.6 shows the speedup gained with our parallel algorithms. We get
a speedup of between 16x and 20x with the parallel gale-shapley algorithm
compared to the sequential gale-shapley algorithm. The parallel mcvitie-
wilson algorithm does a little bit worse with between 14x and 19.9x speedup
compared to the sequential gale-shapley algorithm. The dip in speedup from
27 to 36 threads could be due to another task on the system using one thread. This
is further substantiated with our results on the older system “brake” which did not

32
CHAPTER 5. EXP. RESULTS 5.3. STABLE MARRIAGE RESULTS

75M mcvitie-wilson 75M gale-shapley


40 100M mcvitie-wilson 100M gale-shapley
125M mcvitie-wilson 125M gale-shapley
100M mcvitie-wilson @ brake 100M gale-shapley @ brake
30
Speedup

20

10

10

1 9 18 27 36 72
Threads

Figure 5.6: Easy instances speedup (T (1)/T (p),more is better)

display this dip in performance. On 72 threads there are multiple threads per
processor which lessens the impact of one slower thread. We believe this difference
in speedup, with the gale-shapley algorithm getting a better speedup, is due to
the additional cache available with each extra cpu core.

5.3.3 Hard instances


On the hard instances, even though the instances are greatly reduced in memory
footprint, the time required to solve them is greatly increased, as shown in the
difference between the figures 5.4 and 5.7. Here we see the largest instances taking
over 50 minutes to complete by the sequential algorithm. This is because the
average number of edges we have to process has grown from < 5 to n−1 2
. The
memory required to store these instances is now reduced to 1.6 - 4.6MB, an amount
that is cacheable even with the random access patterns. We can see the effect of
this because now the McVitie-Wilson algorithm is the fastest and Gale-Shapley

33
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS

the slower one. We believe this is due to the extra queue operations, which now
outweighs the cache gains from the sequential memory access in the first part, now
being less than 0.3% of the total runtime on the sequential version.

200
Time (seconds)

150

100

50
mcvitie-wilson p = 72
0 gale-shapley p = 72

100K 200K 300K 400K 500K


N
Figure 5.8: Parallel runtime on hard instances (less is better)

Our parallel algorithms performs


just as well on the hard instances.
mcvitie-wilson
The parallel mcvitie-wilson algo- gale-shapley
rithm traverses 744M edges per sec-
Time (minutes)

ond at 72 threads, reducing the runtime 40


from 47 minutes to 2.75 minutes. The
gale-shapley variant gets gradually
worse relative to the parallel mcvitie- 20
wilson, likely because of the O(n2 )
number of queue reads and writes it
does to keep track of the unmatched
men, traversing 574M edges per second 0
at 72 threads, going from 53 minutes to 100K 200K 300K 400K 500K
3.5 minutes. N
As we can see from Figure 5.10, Figure 5.7: Hard instances sequential
the efficiency of the parallel gale- runtimes on Lyng (less is better)
shapley and parallel mcvitie-wilson

34
CHAPTER 5. EXP. RESULTS 5.3. STABLE MARRIAGE RESULTS

400K mcvitie-wilson
15 400K gale-shapley
500K mcvitie-wilson
500K gale-shapley
Speedup

10

1
0
1 9 18 27 36 72
Threads

Figure 5.9: Speedup on hard instances (more is better)

1 125M Easy gale-shapley


500K Hard gale-shapley
125M Easy mcvitie-wilson
0.8 500K Hard mcvitie-wilson
Efficiency

0.6

0.4

0.2

1 9 18 27 36 72
Threads

Figure 5.10: Efficiency on easy and hard instances (more is better)

35
5.3. STABLE MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS

is quite similar. The efficiency of gale-


shapley is a little more divergent on the higher thread counts, which is explained
by the slight load imbalance that may ocurr.

5.3.4 Real graphs

We will now present our results on the graphs obtained from the SuiteSparse Matrix
Collection.

kron_g500-logn21 mcvitie-wilson
0.3 kron_g500-logn21 gale-shapley
Fault_639 mcvitie-wilson
Fault_639 gale-shapley
Time (seconds)

0.2 Reuters_911 mcvitie-wilson


Reuters_911 gale-shapley

0.1

0
1 9 18 27 36 72
Threads
Figure 5.11: Real instances timing (less is better)

As previously noted these graphs are preprocessed into a stable marriage in-
stance, which includes sorting and expansion of the graph, but the time for this is
not included in our timing. Because of this the time is not really comparable to
other Greedy Matching results. As Figure 5.11 shows, the time it takes to obtain
a stable matching in these graphs is quite small (all instances were solved in under
a second sequentially), and the gain from parallelising is diminishing. Regardless
of that we get a speedup of 4.5 (Figure 5.12) on the kron_g500-logn21 graph.
The time it took to solve the Reuters911 graph was below the resolution of our
measurement.

36
CHAPTER 5. EXP. RESULTS 5.4. B-MARRIAGE RESULTS

10
kron_g500-logn21 mcvitie-wilson
kron_g500-logn21 gale-shapley
Speedup (more is better)

8
Fault_639 mcvitie-wilson
Fault_639 gale-shapley
6 Reuters_911 mcvitie-wilson
Reuters_911 gale-shapley
4

0
1 9 18 27 36 72
Threads
Figure 5.12: Real instances speedup (more is better)

5.3.5 Xeon Phi performance


In this section we present our results for running the gale-shapley and mcvitie-
wilson algorithms on the Xeon Phi.
In Figure 5.13 and Figure 5.14 we compare the best runtime on the Xeon Phi
to the best runtime on Lyng. With only one exception (mcvitie-wilson on the
easy instances with 10M men/women), the Phi performed significantly worse than
on Lyng. The per-thread reduction in performance is expected since the clock
speed of the Xeon Phi is roughly 50% that of Lyng (the raw speed of 144 threads
compare to that of 72 threads on the host). This, combined with the increased
contention and subsequent retries explains the poorer performance on higher thread
counts. In Figure 5.15 we present the speedup from running mcvitie-wilson and
gale-shapley on the Xeon Phi. Note that the speedup is measured relative to
the sequential gale-shapley on Lyng. The jitter observed in this plot is likely
because of the random nature of the retry when compare and swap fails.

5.4 b-Marriage results


In this section we present our experimental results with the parallel b-marriage
algorithm. We compare the parallel b-marriage algorithm (Algorithm 7) to the

37
5.4. B-MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS

0.7
mcvitie-wilson @ lyng
gale-shapley @ lyng
0.6 mcvitie-wilson @ Xeon Phi
gale-shapley @ Xeon Phi
Time (seconds)

0.5

0.4

0.3

10M 20M
N
Figure 5.13: Best Xeon Phi (240 threads) vs Lyng (72 threads) times on easy
instances (less is better)

500
gale-shapley @ Xeon Phi
mcvitie-wilson @ Xeon Phi
400 gale-shapley @ lyng
mcvitie-wilson @ lyng
Time (seconds)

300

200

100

0
100K 200K 300K 400K
N
Figure 5.14: Best Xeon Phi (240 threads) vs Lyng (72 threads) times on hard
instances instances (less is better)

38
CHAPTER 5. EXP. RESULTS 5.4. B-MARRIAGE RESULTS

10
10M mcvitie-wilson 10M gale-shapley
20M mcvitie-wilson 20M gale-shapley
8
Speedup

2
1
72 80 96 112 128 144 160 176 192 208 224 240
Threads
Figure 5.15: Easy instances speedup on Xeon Phi over gale-shapley on Lyng
(more is better)

parallel mcvitie-wilson algorithm (Algorithm 6), to show the overhead imposed


by keeping a queue of suitors. We present the speedup on the different datasets
with b = 1, 3 and 5. We tested 2 different implementations for the priority queue
used to store the suitors in. One using linear search (LS) through an array and
one using a heap. We show that different priority queue implementations have a
significant impact on the efficiency when we vary b.
First it is interesting to know how the b-marriage algorithm performs in
relation to the mcvitie-wilson algorithm. Figure 5.16 shows the runtime of the
parallel mcvitie-wilson algorithm and the parallel b-marriage algorithm with
b = 1 on easy instances. It shows the b-marriage to use between 40% − 50%
more time than the mcvitie-wilson algorithm. This is likely due to three factors:
the locking necessary for updating the priority queue, the atomic fetch and add
in the nextCandidate function, and extra operations to update the priority queue
itself. We did not test the algorithm on b = 1 above 60M as we believe as we
believe a point was made: a specialised version of the algorithm for problems where
b = 1 yields significant runtime gains over using the more general b-marriage
algorithm.
In Figure 5.17 we present the runtime results for the b-marriage algorithm.
We see that the runtime on b = 5 is significantly slower than the same datasets

39
5.4. B-MARRIAGE RESULTS CHAPTER 5. EXP. RESULTS

b-marriage b = 1 400 b = 3 p = 72
60 mcvitie-wilson b = 5 p = 72
Time (seconds)

Time (seconds)
300
40
200

20
100

0 0
10M 20M 40M 50M 60M 10M30M50M 75M 100M125M
N N
Figure 5.16: b-marriage (LS) b = 1 Figure 5.17: b-marriage (heap) best
vs mcvitie-wilson runtime (less is bet- runtime (less is better)
ter)

when b = 3. This is likely due to there being 53 times more matches to be made.
Figure 5.18 shows speedup on easy instances (with heap as queue implemen-
tation) with b = 3 and b = 5. Here we see that that the parallel algorithm gives
speedups of up to 41x on b = 5 and up to 27.8x on b = 3. We think the increase in
speedup with the increased value of b is caused by the parallel algorithm spending
proportionally less time in a critical section.
In our tests the atomic operation did better than locking and unlocking with
OpenMP lock structures.
Recall from Algorithm 7 that if the queue is not full it does not dequeue the
worst candidate. When we increase the value of b, we increase the amount of
matches being made without rejecting an existing proposal, and therefore spend
proportionally less time with a lock. Depending on the implementation of the
queue this operation might be expensive. The results presented Figure 5.18 were
done with a heap, which requires O(log(b)) operations for each enqueue/dequeue
operation.
In Figure 5.19 we run the same algorithm where the heap is replaced with a
linear search through an array. The linear search requires O(b) operations for both
enqueue and the enqueue/dequeue operation, but has a much smaller overhead on
small values of b. The difference in time spent locked in the two execution paths
is visible in the plot, making the difference in speedup between the two values of b
smaller. This difference in efficiency is also evident in the effiency plot presented

40
CHAPTER 5. EXP. RESULTS 5.4. B-MARRIAGE RESULTS

40

30
Speedup

20

10 75M b = 3 75M b = 5
100M b = 3 100M b = 5
10 125M b = 3 125M b = 5
1 9 18 27 36 72
Threads
Figure 5.18: Easy instances speedup with heap variant (more is better)

30

25

20
Speedup

15

10
75M b = 3 75M b = 5
5
100M b = 3 100M b = 5
10 125M b = 3 125M b = 5
1 9 18 27 36 72
Threads
Figure 5.19: Easy instances speedup with linear search variant (more is better)

41
5.5. POPULAR MATCHING RESULTS CHAPTER 5. EXP. RESULTS

125M heap b = 3 125M heap b = 5


1.2 125M LS b = 3 125M LS b = 5

1
Efficiency

0.8

0.6

0.4

1 9 18 27 36 72
Threads
Figure 5.20: Easy instances efficiency comparison (T (1)/T (p)/p, more is better)

in Figure 5.20, where the efficiency of the LS variant is less dependent on b than
the heap variant.
For the hard instances we experienced more variance in the speedup, as shown
in Figure 5.21. For these results we did not have exclusive access to the machine,
but they serve as an indication of what we can expect without competition.

5.5 Popular matching results


In this section we present the experimental results of running the sequential and
parallel versions of the huang-kavitha algorithm. As explained in Section 3.3,
the popular matching problem is primarily a problem on SMI instances (on SM
instances this is simply an instance of the stable marriage problem). Because of
this we only test the algorithms on SMI instances.
In Figure 5.22 we compare the runtime of the parallel mcvitie-wilson algo-
rithm and the parallel huang-kavitha algorithm. Recall from Section 3.3.1 that
only the unmatched men of the first stable marriage solution is lifted to a higher
level and retried. Because the stable marriage match percentage on these instances
is quite high (see Section 5.3.1), the number of men brought up to the second level
is quite low (less than 3%). This makes the huang-kavitha algorithm perform
quite close to the mcvitie-wilson algorithm, with a slight deviation at the 75M,

42
CHAPTER 5. EXP. RESULTS 5.5. POPULAR MATCHING RESULTS

40

30
Speedup

20

10
100K b = 5 200K b = 5
10 300K b = 5 400K b = 5
1 9 18 27 36 72
Threads
Figure 5.21: Hard instances speedup (more is better)

Parallel huang-kavitha
4 Parallel mcvitie-wilson
Time (seconds)

0
10M 20M 40M 50M 60M 75M 100M 125M
N

Figure 5.22: huang-kavitha time vs mcvitie-wilson time on easy instances


(less is better)

43
5.5. POPULAR MATCHING RESULTS CHAPTER 5. EXP. RESULTS

100M and 125M instances.


On our hardware the single-width and double-width compare and swap opera-
tions perform with the same latency, so the increased data to compare and swap
should not impact the speedup. This is also our observation in Figure 5.23 where
the speedup is comparable to that of the stable marriage algorithms presented in
Section 5.3.

25

20
Speedup

15

10

1 Easy 75M Easy 100M Easy 125M


0
1 9 18 27 36 45 54 63 72
Threads
Figure 5.23: huang-kavitha speedup on easy instances (more is better)

44
Chapter 6

Conclusion

In this chapter we will first give a summary of the thesis, a short conclusion based
on our experimental results, and then some thoughts about further improvements.
In the first three chapters we introduced the notion of matching, parallelising
and preference. We then gave implementations to three different problems for
matching under preference: the problem of stable marriage, the closely related
b-marriage and popular matching problem. In chapter 4 we first presented some
previous work on parallel algorithms for the stable marriage problem. We intro-
duced our parallel algorithm for the previously presented problems based on ideas
from the parallel suitor and b-Suitor algorithms. In chapter 5 we presented
runtime results for both the sequential and parallel algorithms showing scaling on
all but the smallest instances.
The experimental results show that the ideas presented in the Suitor and
b-Suitor transfer well to the parallel algorithms for matching under preferences
presented in this paper.

45
Bibliography

[1] T. A. Davis and Y. Hu. The university of Florida sparse matrix collection.
ACM Transactions on Mathematical Software, 38(1):1 – 25, 2011.

[2] D. Gale and L. S. Shapley. College admissions and the stability of marriage.
The American Mathematical Monthly, 69(1):9–15, 1962.

[3] David Gale and Marilda Sotomayor. Some remarks on the stable matching
problem. Discrete Applied Mathematics, 11(3):223 – 232, 1985.

[4] Chien-Chung Huang and Telikepalli Kavitha. Popular matchings in the sta-
ble marriage problem. Information and Computation, 222:180 – 194, 2013.
38th International Colloquium on Automata, Languages and Programming
(ICALP 2011).

[5] M.Elizabeth C. Hull. A parallel view of stable marriages. Information Pro-


cessing Letters, 18(2):63 – 66, 1984.

[6] Arif Khan, Alex Pothen, Md. Mostofa Ali Patwary, Nadathur Rajagopalan
Satish, Narayanan Sundaram, Fredrik Manne, Mahantesh Halappanavar,
and Pradeep Dubey. Efficient approximation algorithms for weighted $b$-
matching. SIAM Journal on Scientific Computing, 38(5):S593–S619, 2016.

[7] Jesper Larsen, Claus C. Carøe, and David Pisinger. A parallel approach to
the stable marriage problem. pages 277–287, 1997.

[8] F. Manne, M. Naim, H. Lerring, and M. Halappanavar. On Stable Marriages


and Greedy Matchings, pages 92–101. Society for Industrial and Applied Math-
ematics, 2017/05/31 2016.

[9] Fredrik Manne and Rob H. Bisseling. A Parallel Approximation Algorithm for
the Weighted Maximum Matching Problem, pages 708–717. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2008.

46
BIBLIOGRAPHY BIBLIOGRAPHY

[10] D. G. McVitie and L. B. Wilson. The stable marriage problem. Communica-


tions of the ACM, 14(7):486–490, July 1971.

[11] H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic opera-
tions on modern architectures. In 2015 International Conference on Parallel
Architecture and Compilation (PACT), pages 445–456, Oct 2015.

[12] S. S. Tseng and R. C. T. Lee. A parallel algorithm to solve the stable marriage
problem. BIT Numerical Mathematics, 24(3):308–316, 1984.

47
Appendix A

Appendix

A.1 On Stable Marriages and Greedy Matchings

48
On Stable Marriages and Greedy Matchings
Fredrik Manne∗ , Md. Naim∗ , Håkon Lerring∗, and Mahantesh Halappanavar†

Research on stable marriage problems has a long and ing greedy weighted matchings, and algorithms for solving
mathematically rigorous history, while that of exploiting stable marriage problems. Although there exist exact algo-
greedy matchings in combinatorial scientific computing is rithms for solving various weighted matching problems these
a younger and less developed research field. In this paper tend to have running times that typically involve the product
we consider the relationships between these two areas. In of the number of vertices and the number of edges. As large
particular we show that several problems related to comput- graph instances can contain tens of millions of vertices and
ing greedy matchings can be formulated as stable marriage billions of edges it is clear that such algorithms can easily
problems and as a consequence several recently proposed al- become infeasible. For this reason there has been a strong in-
gorithms for computing greedy matchings are in fact special terest in developing fast approximation algorithms and also
cases of well known algorithms for the stable marriage prob- in parallelizing these, see [19] and the references therein. Al-
lem. though such algorithms typically only guarantee an approx-
However, in terms of implementations and practical imation factor of 0.5 compared to the optimal one, practical
scalable solutions on modern hardware, work on computing experiments have shown that they are very often only within
greedy matchings has made considerable progress. We a few percent from optimal. One such algorithm is the clas-
show that due to the strong relationship between these two sical greedy algorithm applied to an edge weighted graph.
fields many of these results are also applicable for solving Here edges are considered by decreasing weight and an edge
stable marriage problems. This is further demonstrated by is included in the matching if it is not adjacent to an already
designing and testing efficient multicore as well as GPU included edge.
algorithms for the stable marriage problem. The main contributions of this paper are as follows. Ini-
tially we consider implementation issues when designing ef-
1 Introduction ficient algorithms for the stable marriage problem. Next,
In 1962 Gale and Shapley formally defined the stable mar- we show that several recently published algorithms for com-
riage problem and gave their classical algorithm for its so- puting greedy matchings are in fact special cases of classi-
lution [6]. Since then this field has grown tremendously cal algorithms for stable marriage problems. This also in-
with numerous applications both in mathematics and in eco- cludes a generalization of the matching problem known as
nomics. For a recent overview see the book by Manlowe b-matching. Here each vertex can be matched with several
[17]. Graph matching is a related area where the object is other vertices in the final solution. Due to the strong similari-
also to find pairs of entities satisfying various optimality cri- ties between the stable marriage problem and greedy match-
teria. These problems find a large number of applications. ing, we show that one can apply recent results on design-
For an overview of some of these motivated from combina- ing scalable greedy matching algorithms to the computation
torial scientific computing see [24]. of stable marriage solutions. This is verified by presenting
While research on stable marriage problems has mainly efficient parallel implementations of various types of Gale-
focused on theory and mathematical rigor, work on graph Shapley type algorithms for both multithreaded computers
matching in combinatorial scientific applications has had as well as for GPUs.
a larger practical component concerned with implementing The remainder of the paper is organized as follows.
and testing code on various computer architectures with the In Section 2 we review the Gale-Shapley algorithm and
intent of developing fast scalable algorithms. consider implementation issues related to this. Next, in
In this paper we investigate the connection between Section 3 we show that the computation of a greedy matching
one type of matching problems, namely those of comput- can be reformulated as a stable marriage problem. This is
also shown to be true for greedy b-matching. In Section 4
we give parallel implementations of the Gale-Shapley and
∗ Department of Informatics, University of Bergen, N-5020 Bergen,
McVitie-Wilson algorithms and show their scalability, before
Norway. Email: {fredrikm,naim}@ii.uib.no, [email protected]
† Pacific Northwest National Laboratory, 902 Battelle Boulevard, concluding in Section 5.
P.O.Box 999, MSIN J4-30, Richland, WA 99352, USA. Email: ma-
[email protected]
2 The Stable Marriage Problem round. The running time of such a scheme is Θ(n2 ) even for
In the following we describe the stable marriage (SM) prob- an instance of SMI.
lem and how it can be solved using the Gale-Shapley algo- If one is willing to forgo the requirement that the pro-
rithm. We also consider some implementation issues related posals in each round must be made in the same relative or-
to the algorithm. Finally, we review some generalizations of der then it is not hard to design an implementation of the
the SM problem. Gale-Shapley algorithm with running time proportional to
The SM problem is defined as follows. Let L and R the number of actual proposals made. To do this one can
be two equal sized sets L = {l1 , l2 , . . . , ln } and R = maintain a queue Q of men waiting to make their proposals.
{r1 , r2 , . . . , rn }. The entries in L are typically referred to Initially Q = L and in each step of the algorithm the man
as “men”, while the entries in R are referred to as “women”. at the front of the queue gets to propose to his current best
Every man and woman has a total ranking of all the members candidate rj ∈ R, and any rejected li is inserted at the end
of the opposite sex. These rankings give the “desirability” of the queue. This will ensure that all men rejected in round
for each participant to match with a participant in the other t gets to propose before any man rejected in round t + 1, but
set. The object is now to find a complete matching M (i.e. a the relative order among the men might not always be the
paring) between the entries in L and R such that no two same. The algorithm terminates when the queue is empty.
li ∈ L and rj ∈ R both would obtain a higher ranked One simple enhancement one can make to the Gale-
partner if they were to abandon their current partner in M Shapley algorithm is that no li ∈ L should propose to an
and rematch with each other. Any solution satisfying this rj ∈ R who already has a proposal from someone whom rj
condition is stable. ranks higher than li , as such a proposal will be rejected. Thus
Gale and Shapley [6] defined the stable marriage prob- each li should propose to his most preferred rj where li has
lem and also proposed the first algorithm for solving it. The not already been rejected and where rj ranks li higher than
algorithm operates in rounds as follows. In the first round her current best proposal (if any). In terms of implementation
each man in L proposes to his most preferred woman in this means that it is sufficient to only maintain the current
R. Each woman will then reject all proposals except the best proposal for each rj . When the algorithm terminates
one she has ranked highest. In subsequent rounds each man these proposals will make up the solution. We give our
that was rejected in the previous round will again propose to complete implementation of the Gale-Shapley algorithm in
the woman which he has ranked highest, but now disregard- Algorithm 1.
ing any woman that he has already proposed to in previous In Algorithm 1 each rj has a variable suitor(rj )
rounds. Gale and Shapley showed that this process will ter- initialized to N U LL that holds her current best pro-
minate with each man in L being matched to a woman in R posal. Similarly, ranking(rj , li ) returns rj ’s ranking of
and that this solution is stable. Although an SM instance can li (as a number in the range 1 through n). We define
have many stable solutions, the Gale-Shapley algorithm will ranking(rj , N U LL) = n + 1 to ensure that any proposal
always produce the same one. is better than no proposal. The function nextCandidate(li )
An important variant of this problem is when each par- will initially return li ’s highest ranked woman and then for
ticipant has only ranked a subset of the opposing partici- successive calls return the next highest ranked one following
pants. This problem is known as the stable marriage prob- the last one retrieved.
lem with incomplete lists (SMI). Any solution M to an SMI For an SM instance it is straight forward to precompute
instance must then in addition to being stable, also consists the values of ranking() in O(n2 ) time. However, for
of mutually ranked pairs (li , rj ). The SMI problem is solved an SMI instance maintaining a complete ranking() table
by the Gale-Shapley algorithm but the solution might not be would require O(n2 ) space and also proportional time to
complete leaving some participants unmarried [6]. initialize it. In this case it is more efficient to store the value
There exists a number of variants of the SM problem, of ranking(ri , lj ) together with ri in lj ’s ranking list so that
for two comprehensive surveys see the books [10, 17]. In the it can be fetched in O(1) time when needed. These values
following we will only consider the classical SM and SMI can be precomputed in time proportional to the sum of the
problems. lengths of the ranking lists. To do this one first traverses
The original Gale-Shapley algorithm is described as the women’s lists building up lists for each man lj with the
operating in rounds, where only the men who were rejected women that have ranked him along with in what position.
in round t will propose in round t + 1. It is not stated in Then using an array position() of length n initially set to
which order the proposals in a round should be made or what 0, the list of each man lj is processed as follows. For each
kind of data structures to use. If one traverses the men in L woman ri that has ranked lj we store the value lj along with
in their original order in each round and lets each rejected in what position ri has ranked lj in position(ri ). We next
man propose once it is discovered, then this will ensure that traverse lj ’s priority list and for each ri in the list we look
the men always propose in the same relative order in each up position(ri ) and see if it contains lj . If so, we fetch ri ’s
ranking of lj and store it together with lj ’s ranking of ri . At preferences, if the men’s preferences are random, then the
the same time any ri that has not ranked lj but which lj has expected sum of men’s rankings of their mates as assigned
ranked can be purged from the priority list of lj . by the Gale-Shapley algorithm is bounded above by n(1 +
1/2 + ... + 1/n). Knoblauch [16] showed that this is also
Algorithm 1 The Gale-Shapley algorithm using a queue an approximate lower bound in the sense that the ratio of
1: Q = L the expected sum of men’s rankings of their assigned mates
2: while Q 6= ∅ do and (n + 1)((1 + 1/2 + ... + 1/n) − n) has limit 1 as
3: u = Q.dequeue() n goes to ∞. Thus if the men’s preferences are random
4: partner = nextCandidate(u) then this sum is Θ(n ln n) for large n. However, it is not
5: while ranking(partner, u) > hard to design instances where this sum is Θ(n2 ). This is
ranking(partner, suitor(partner)) do for instance true for any SM instance where the men have
6: partner = nextCandidate(u) identical preferences.
7: if suitor(partner) 6= N U LL then
8: Q.enqueue(suitor(partner)) 2.1 Generalizations of SM We next review two general-
9: suitor(partner) = u izations of the SM problem. In the stable roommates (SR)
problem one is given a set of n persons, each one with a
complete ranking of all the others persons. The objective
McVitie and Wilson [5] gave a recursive implementation
is now to pair two and two persons together, i.e. make them
of the Gale-Shapley algorithm. This algorithm also iterates
roommates, such that there is no pair (x, y) of persons where
over the men, allowing each one to make a proposal to his
x is either unmatched or prefers y to its current partner, while
most preferred woman. But if this proposal is rejected or if it
at the same time y is either unmatched or prefers x to its cur-
results in an existing suitor being rejected then the algorithm
rent partner. Just like for the SM problem, such a solution
recursively lets the just rejected man make a new proposal to
is called stable. Unlike the SM problem there might not ex-
his best remaining candidate. The recursion continues until
ist a solution for an SR instance. We also note that if some
a proposal is made such that no man is rejected (because the
persons have only ranked a subset of the other participants
last proposed to woman did not already have a suitor). At
we get the stable roommates problem with incomplete lists
this point the algorithm will continue with the outer loop and
(SRI). Irving gave an algorithm for computing a stable solu-
process the next man. When all men have been processed
tion to an SRI problem or to determine that no such solution
the algorithm is finished. It is shown in [5] that the McVitie-
exists [12]. This algorithm operates in two stages, where
Wilson algorithm produces the same solution as the Gale-
the first one is similar to the Gale-Shapley algorithm where
Shapley algorithm. We note that similarly to the Gale-
each person makes, accepts and rejects proposals. The sec-
Shapley algorithm it is possible to speed up the algorithm by
ond phase of the algorithm is slightly more involved but does
avoiding proposals that are destined to be rejected because
not change the running time of O(n2 ). For more information
the proposed to woman already has a better offer.
on the SR and SRI problems see [10, 17].
Comparing the two algorithms each man will consider
In the last generalization each person can be matched
exactly the same women before ending up with his final
with more than one person in a final solution. More formally,
partner. The only difference is the order in which this is
we are looking for a stable solution to an SM, SMI, SR, or
done. While the Gale-Shapley algorithm will maintain a
SRI instance where each person vi is matched with at most
list of men that needs to be matched, the McVitie-Wilson
b(vi ) other persons, where b(vi ) ≥ 1. Being stable again
algorithm will always maintain a solution where each man
means that no two persons vi and vj would both obtain a
considered so far is matched before including a new man in
better solution if they were to match with each other, either
the solution.
by dropping one of their current partners or if vi has fewer
We note that one can implement a non-recursive ver-
than b(vi ) partners or if vj has fewer than b(vj ) partners.
sion of the McVitie-Wilson algorithm simply by replacing
For the SM problem this gives us the many-to-many
the queue Q in Algorithm 1 by a stack and replacing the
stable assignment problem (MMSA), where each “man” and
dequeue() and enqueue() operations with pop() and push()
“woman” can be matched with several participants of the
operations respectively. To see that this will result in the
opposite sex. This was solved by Baı̈ou and Balinski [2]
McVitie-Wilson algorithm it is sufficient to first note that the
who presented a general algorithm based on modelling this
initial placement of L in Q is equivalent to an outer loop
as a graph searching problem. See also [3] for how to model
that processes each man once. Any rejected man will then
the SM problem as a graph problem [3].
be placed at the top of the stack and therefore be processed
Applying the last generalization to an SR instance,
immediately, similarly to a recursive call in the original al-
we get the stable fixtures problem [13] for which Irving
gorithm.
and Scott gave an O(n2 ) algorithm. Similarly to Irving’s
Wilson [29] showed that for any profile of womens
algorithm for SRI, this algorithm also consists of two stages, increase if (u, v) was added to M while removing any edges
where the first stage is a natural extension of the Gale- incident on either u or v from M . This observation is often
Shapley algorithm to handle that each person can participate stated as that the solution given by G REEDY does not contain
in multiple matches. any augmenting path containing three or fewer edges. An
augmenting path of length k is a path containing k edges
3 Matching Problems starting with an edge in M and then alternating between
We next explore the relationship between stable marriages edges not in M and edges in M , with the property that if
and weighted matchings in graphs. A matching M on a one was to replace all the edges on the path that belong to M
graph G(V, E) is a subset of the edges such that no two with those that are not in M then the weight of the solution
edges in M share a common end point. For an unweighted would increase.
graph the object is to compute a matching of maximum We next proceed to show that the solution given by
cardinality. For an edge weighted graph a typical problem is G REEDY can also be obtained by solving an associated SMI
to compute a matching M such that the sum of the weights (or SM) instance. To the best of our knowledge this result
of the edges in M is maximum over all matchings. Another has not been shown previously.
variant could be to compute the maximum weight matching Given an instance of the GM problem on an edge
over all matchings of maximum cardinality. weighted graph G(V, E). We define an SMI instance G0
We consider the G REEDY algorithm for computing a from G as follows. Let L and R be the sets of men and
matching of maximum weight in an edge weighted graph women respectively, both of size n = |V |. Any man li
where all weights are positive. This algorithm considers will include exactly those rj in its ranking where there is
edges by decreasing weight. In each step the heaviest re- an edge (vi , vj ) ∈ E. As the edges in G are not directed,
maining edge (u, v) is included in the matching before re- this also means that lj will rank ri . Similarly, any woman
moving any edge incident on either u or v. If the weights rj will include exactly those li in her ranking where there is
of the edges in G are unique or if a consistent tie break- an edge (vi , vj ) ∈ E. Both men and women order their lists
ing scheme is used then it follows that the solution given by decreasing weight of the corresponding edges in G. Thus
by G REEDY is also unique. In the following we will al- every (vi , vj ) ∈ E gives rise to four rankings in G0 . We call
ways assume that this is the case. It is well known that the two pairs (li , rj ) and (lj , ri ) for the corresponding pairs
G REEDY guarantees a solution of weight no worse than 0.5 of (vi , vj ).
times the weight of an optimal solution. We label the prob-
L EMMA 3.1. Given a graph G with SMI instance G0 as
lem of computing a greedy matching in an edge weighted
described above. Let further M be the greedy matching on
graph as the GM problem.
G. Then the pairs in G0 corresponding to the edges in M
Given an instance G of the GM problem one can con-
make up the unique solution to the SMI problem on G0 .
struct an equivalent instance of the SRI problem by sorting
the edges incident on each vertex u by decreasing weight, Proof. The proof is by induction on the edges of M con-
and letting this be the ranking of u’s neighbors in the SRI sidered by decreasing weight. Let (vi , vj ) be the edge of
instance. It is not hard to see that with this construction, a maximum weight in G. Then (vi , vj ) ∈ M and it also fol-
solution to the GM problem is equivalent to a stable solution lows from the construction of G0 that li will rank rj highest.
of the corresponding SRI problem. Consider the heaviest Similarly, li will also have the highest ranking among the
edge (u, v) in the graph. This is included in the GM solution men ranked by rj . Thus (li , rj ) must be included in any sta-
and the corresponding vertex pair must also be part of any ble solution of G0 . A similar argument shows that the edge
solution to the SRI instance, otherwise this solution would (lj , ri ) will also be included in such a solution.
not be stable as both u and v would prefer to match with Assume now that the pairs in G0 corresponding to the
each other over any other partner. We can thus include (u, v) k ≥ 1 heaviest edges in M must be included in any
in the solutions to both instances and also remove u and v stable solution and consider the two pairs (ls , rt ) and (lt , rs )
from further consideration. For the GM problem this means corresponding to the k + 1st heaviest edge (vs , vt ) in M .
that any edges incident on either u or v are removed and It is clear that any solution where ls is matched to a
for the SRI instance u and v are removed from all ranking woman that he has ranked after rt while at the same time
lists. One can then repeat the argument using the heaviest rt is matched to a man that she has ranked after ls , cannot
remaining edge, and it thus follows by induction that the two be stable as both ls and rt would be better of if they were to
solutions are identical. It is also clear that the corresponding match with each other. Thus if (ls , rt ) is not included in a
SRI instance always has a unique stable solution. stable solution at least one of ls and rt must be matched to a
The above construction implies that the solution given partner which he or she has ranked higher than the other one
by G REEDY is stable in the sense that there does not exist of {ls , rt }. Assume therefore that ls is matched to ru and
an edge (u, v) 6∈ M such that the weight of M would that ls has ranked ru higher than rt , implying that the weight
of (vs , vu ) is greater than the weight of (vs , vt ) in G. But stance. Also let {rs1 , rs2 , . . . , rsf } and {lt1 , lt2 , . . . , ltg } be
since (vs , vt ) ∈ M it follows that (vs , vu ) 6∈ M . Thus vu the ranked lists of the man li and the woman ri respectively.
must be matched to some other vertex vz in M . And since Then if follows from the construction of G0 that f = g and
(vs , vu ) 6∈ M the weight of (vu , vz ) must be greater than that that sk = tk for all k. Thus any proposal made to ri could
of (vs , vu ). By the induction hypothesis the pairs in G0 that be handled directly by li as he has the same information as
correspond to the k heaviest edges in M must be included ri . It follows that one can merge li and ri into one node vi
in any stable solution in G0 . It therefore follows that lz must that handles making, accepting, and rejecting proposals re-
be matched to ru in any stable solution on G0 contradicting lated to li and ri . In this way both the Gale-Shapley and
that ls is matched to ru . A similar argument shows that rt the McVitie-Wilson algorithm can be used directly on edge
cannot be matched to any man in L to which she gives higher weighted general graphs to compute greedy matchings, but
priority than she gives to ls . Thus the pair (ls , rt ) must be in now using edge weights to rank potential partners.
any stable solution in G0 . The argument for why (lt , rs ) also Irving’s algorithm [12] for solving the SRI problem
must be included in a stable solution is analogous. It follows consists of two stages, of which the first is exactly this
that any pair in G0 corresponding to an edge in M must be algorithm of making, accepting, and rejecting proposals on
part of a stable marriage in G0 . a general graph. If the rankings in an SRI instance are based
It only remains to show that once the pairs correspond- on edge weights from a GM instance then the first phase will
ing to the edges in M have been included in the solution M 0 produce the greedy solution which is stable, thus making the
to G0 , then it is not possible to match any other pairs in G0 . second phase of the algorithm redundant.
If M 0 contains a pair (li , rj ) in addition to the pairs corre- Previous efforts at designing fast parallel greedy match-
sponding to the edges in M then (vi , vj ) 6∈ M and neither vi ing algorithms have been based on the notion of dominant
nor vj can be matched in M . But since li has ranked rj (and edges. These are edges that are heavier than any of their
vice versa) it follows that (vi , vj ) ∈ E and that M can be neighboring edges. Preis showed that an algorithm based on
expanded with (vi , vj ). This contradicts that M is maximal repeatedly including dominant edges in the matching while
and the result follows. removing any edges incident on these will result in the same
solution as G REEDY [25]. Based on this observation Manne
We next consider the b-matching problem which is a
and Bisseling developed the pointer algorithm [18], which
generalization of the regular weighted matching problem
was further enhanced by Manne and Halappanavar in the
similar to the many-to-many stable assignment problem and
S UITOR algorithm [19]. We note that the S UITOR algo-
the stable fixtures problem. A b-matching on G is a subset
rithm is identical to the McVitie-Wilson algorithm applied to
of edges M ⊆ E such that every vertex vi ∈ V has at most
a general edge weighted graph, while the pointer algorithm
b(vi ) edges in M incident on it. The objective is to compute
has strong resemblances to the Gale-Shapley algorithm as
the b-matching of maximum weight. A 0.5 approximation
outlined in Algorithm 1.
can again be computed using the greedy algorithm that
The same type of relationship also holds true between
selects edges by decreasing weight and whenever b(vi ) edges
the greedy b-matching problem and the many-to-many sta-
incident on vi have been selected, the remaining edges
ble assignment problem. The algorithm presented in [2] can
incident on vi are removed [20]. Setting b(vi ) = 1 for all
be instantiated to solve the b-matching problem using a Gale-
vi ∈ V gives a regular (one) matching. See [21] for a survey
Shapley type algorithm where a vertex v will accept the b(v)
of algorithms for computing optimal b-matchings.
best offers at any given time. We note that this is the same
It is straight forward to see that the stable fixtures prob-
algorithm as the one presented by Georgiadis and Papatri-
lem is also a generalization of greedy b-matching. Given
antafilou [8] and also by Khan et al. [14] for computing a
an instance of the greedy b-matching problem, one can also
greedy b-matching. In [14] the authors experiment with what
construct an equivalent many-to-many stable assignment in-
they call delayed versus eager rematching of rejected suit-
stance by setting the bounds b(li ) and b(ri ) equal to b(vi ).
ors. The difference between these two variants is the same
A proof similar to that of Lemma 3.1 shows that these two
as that between a Gale-Shapley and a McVitie-Wilson style
problems have equivalent solutions.
algorithm.
3.1 Algorithmic Similarities As a consequence of the
4 Experiments
fact that the solution given by G REEDY can be obtained by
either solving a properly designed instance of SMI or SRI, As shown in Section 3 much of the theory that has been
any algorithm that solves either of these two problems can developed lately for greedy matching algorithms are mainly
also be used to compute a greedy weighted matching. This restricted versions of previous results from the theory of
process can be simplified as it might be possible to run an stable marriages. However, the work on greedy matchings
SMI or SRI algorithm directly on the original graph. Let has to a large extent been driven by a need for developing
G be an instance of GM and G0 its corresponding SMI in- scalable parallel algorithms for use in scientific applications.
This has lead to the implementation of Gale-Shapley and to the threads before moving on to the next round. However,
McVitie-Wilson type matching algorithms on a large variety synchronization tends to be costly, and experiments done on
of architectures, including distributed memory machines greedy matching problems indicate that this is typically not
[4, 18], multicore computers [11, 15, 19], and GPUs [1, 23]. worth the effort. For the McVitie-Wilson algorithm one can
There has been less emphasis on implementations and load balance the algorithm by using one of the dynamic load
developing working code for the stable marriages problem. balancing strategies in OpenMP such as dynamic or guided
We believe that much of the work done on greedy matchings in the initial assignment of men to threads. This strategy was
can easily carry over to developing efficient code for stable used successfully in experiments for the S UITOR matching
marriage problems. To show the feasibility of this we have algorithm on graphs with highly varying vertex degrees [19].
developed shared memory implementations of both the Gale- For the CUDA algorithm we assign one thread to each
Shapley and McVitie-Wilson algorithms. We used OpenMP man in our implementation of the McVitie-Wilson algorithm.
to parallelize the Gale-Shapley algorithm and both OpenMP Each thread then executes the algorithm similarly to the
and CUDA for parallelizing the McVitie-Wilson algorithm. OpenMP version using a CAS operation to assign a man
In weighted matching both endpoints of an edge (u, v) as the suitor of a particular woman. Using only one thread
evaluates the importance of the edge to the same number, per man allows for a larger number of thread blocks which
i.e. the weight of the edge. Whereas in the stable marriage the runtime environment can balance across the device. But
problem both u and v assign their own ranking of the as the threads within one physical warp operate in SIMD,
other. Thus the main difference between greedy matching the running time of all threads in the same warp will be
algorithms such as those presented in [18, 19] and the Gale- equal to the maximum execution time of any of the threads.
Shapley algorithm is that in the latter, a man who makes Similarly, the threads within the same thread block will not
a proposal evaluates his chance of success based on the release resources until all threads in the block have finished
woman’s ranking, instead of on a common value. Another executing. It would have been possible to statically assign
difference is that it is typically not assumed in weighted multiple men to each thread or to design a dynamic load
matching problems that the neighbor list of a vertex is sorted balancing scheme with the aim of evening out the work
by decreasing weight. It was shown in [19] that when this is load. But this would have resulted in a more complicated
the case, then it both simplifies the algorithm and also speeds algorithm and as our main goal is proof of concept we did
up the execution considerably. not pursue this.
Our parallelization strategy for the McVitie-Wilson Implementing the Gale-Shapley algorithm on the GPU
algorithm using OpenMP closely follows that of the presents additional challenges compared to the McVitie-
S UITOR algorithm as presented in [19], while our CUDA Wilson algorithm. In a Gale-Shapley algorithm the threads
version of the same algorithm is a simplified version of the would have to be grouped so that each thread group operates
S UITOR algorithm used in [23]. In both of our OpenMP on one common queue, where the size of the group could
algorithms the set of men is initially partitioned among the be either a subset of threads in a warp or all the threads
threads who then each run a local version of the correspond- in one thread block. As the number of free men monoton-
ing algorithm until completion. A thread will first search the ically decreases between rounds, there should initially be
list of the current man to locate the woman he gives highest more men than threads assigned to the same queue, some-
priority and where the woman also prefers him to her current thing that would complicate the algorithm. Also, having sev-
suitor (if any). If such a woman is discovered the thread will eral threads operate on one common queue would require
use a compare-and-swap (CAS) operation to become the new synchronization which can be time consuming on the GPU.
suitor of the woman. In this way it is assured that no other For this reason we chose not to implement the Gale-Shapley
thread has changed the suitor value. If the CAS operation algorithm using CUDA.
succeeds the previous suitor (if any) is treated according to As we are not aware of any sufficiently large publicly
the current strategy and is inserted in a local stack (McVitie- available data sets for the stable marriage problem, we have
Wilson) or a local queue (Gale-Shapley). If the CAS oper- designed two different random data sets. The first set has
ation failed because some other thread had already changed been constructed to be relatively easy to solve, whereas the
the suitor value, then if the current man can still beat the new second set is intended to be more time consuming. We
the suitor then the thread will retry with a new CAS opera- label these sets as easy and hard respectively. Each instance
tion, otherwise it will continue searching for the nest eligible consists of n men and n women. In the easy data set
woman. each man is assigned a random number  ∈ [0, 1] and
There is a difference between the algorithms in how they then randomly picks and ranks (1 + ) ln n women. Each
can handle load imbalance. For the parallel Gale-Shapley woman then ranks exactly the men that ranks her. With
algorithm it is possible to synchronize the threads after each this configuration more than 98% of the participants were
round of proposals and then redistribute the unmarried men matched in every final solution and the total number of
proposals is at most 2n ln n with an average of n ln n. In
the hard data set each man has an identical complete random
ranking of all the women. Similarly, all woman share 1.2 OpenMP: McVitie-Wilson

the same random ranking of all the men. Thus there will
OpenMP: Gale-Shapley
GPU: McVitie-Wilson

always exist a complete stable solution and the total number


1

number of considered women will always be n(n + 1)/2. 0.8

Time (seconds)
Moreover, in the hard instances there will be contention 0.6
among the men for obtaining the same set of women, and
thus cause substantial synchronization requirements for the 0.4

parallel algorithms. One obvious difference between the 0.2

datasets is that the easy instances will require more memory


0
access as each participant has an individual ranking list, 5M 10M 15M 20M 25M
while for the hard instances all rankings are stored in two N

shared vectors of length n.


For each value of n we have generated 5 instances and
for each of these we run each algorithm 3 times. For all Figure 1: Running time
1
on the easy dataset
timings we take the average of these 15 runs.
The OpenMP algorithms are run on a computer with two
Intel Xeon E5-2699 processors and 252 Gbytes of memory. 18 OpenMP speedup
Each processor has 18 cores and runs at 2.30GHz. As the GPU speedup
16
machine can use hyperthreading it is possible to use up to
72 threads. However, going beyond 36 threads did not give 14

any additional speedup and thus we restrict our reporting to Speedup 12

maximum 36 threads. The GPU is a Tesla K40m with 12 GB 10


of memory, 2880 cores running at 745 MHz and has CUDA
compute capability version 3.5. 8

In Figure 1 we present results from the easy instances 6

when n varies from 5M up to 25M in steps of 5M. For the


5M 10M 15M 20M 25M
OpenMP algorithms the number of threads is set to 36. For N

most of these instances the running time stays well below


one second. It is only for the n =25M instance that the GPU
algorithm uses slightly more time than one second. This is Figure 2: Speedup 1on the easy dataset
also the largest easy instance that could be run on the GPU.
For smaller instances the GPU algorithm is the fastest
one but as the problem size increases it is slowing down com-
remain true for the larger ones. We note that the worst
pared to the OpenMP algorithms. In general the OpenMP
running time is only marginally larger than four seconds
Gales-Shapley algorithm is faster than the OpenMP McVitie-
on the largest instance. Figure 4 shows the speedup of
Wilson algorithm with as much as 12%. For this setup one
the OpenMP algorithms compared to the sequential Gale-
would expect that the graph displaying the time would re-
Shapley algorithm for the three largest instances when using
semble n ln n as the computing resources is the same for
t = 1, 9, 18, 27 and 36 threads. The Gale-Shapley algorithm
each instance. This is most true for the OpenMP algorithms
outperforms the McVitie-Wilson algorithm in almost all in-
where the time increases close to linearly with n, whereas
stances and reaches a speedup of approximately 15 on the
the time grows faster than n for the GPU algorithm. This
n = 75M instance. We note that the relative performance of
can be seen further in Figure 2 which shows the speedup of
the two algorithms could have been different if there was a
both McVitie-Wilson algorithms compared to their sequen-
more uneven load balance between the threads.
tial counterpart. The OpenMP algorithm gives a constant
Figure 5 shows the running time on hard instances
speedup of about 9, while the speedup of the GPU algorithm
where n increases from 100K up to 500K. These problems
starts out at about 18 but then drops sharply as the size of the
measure how well the algorithms and hardware can handle
instances increase. Thus this is most likely due to insufficient
contention for shared resources. As the dataset only consists
memory on the GPU.
of two vectors we can run the problems using all three codes,
Figure 3 shows running times of the OpenMP algo-
the only limiting factor being time. Since the total amount of
rithms using 36 threads as n increases up to 125M. It can
work grows as O(n2 ) on these instances it is to be expected
be observed that the tendencies for the smaller instances still
that they will require more time than the easy ones. From
OpenMP: McVitie-Wilson OpenMP: McVitie-Wilson
OpenMP: Gale-Shapley 250 OpenMP: Gale-Shapley
4 GPU: McVitie-Wilson

200
3
Time (seconds)

Time (seconds)
150

2
100

1 50

0 0

5M 25M 50M 75M 100M 125M 100K 200K 300K 400K 500K
N N

Figure 3: Running time of OpenMP


1
algorithms on large Figure 5: Running time
1
on hard datasets
easy instances

400K McVitie-Wilson
400K Gale-Shapley
16 75M McVitie-Wilson 20 500K McVitie-Wilson
75M Gale-Shapley 500K Gale-Shapley
14 100M McVitie-Wilson
100M Gale-Shapley
125M McVitie-Wilson 15
12
125M Gale-Shapley
Speedup
10
Speedup

10
8

6
5

2 0
1 9 18 27 36
0
Threads
1 9 18 27 36
Threads

Figure 6: Speedup1 on hard datasets


Figure 4: Speedup on large easy datasets 1

current threads that, at least initially, will be competing for


the figure it can be observed that there is little difference in matching their man with the same set of women. Synchro-
the running time between the Gale-Shapley OpenMP code nizing this will lead to a much larger strain on the system
and the McVitie-Wilson GPU code, which both take close to compared to the OpenMP algorithm where there is never
250 seconds on the largest instance. However, the McVitie- more than 36 concurrent threads.
Wilson OpenMP code performs considerably better, and is Finally, in figures 7 and 8 we show the number of
a factor of 5 times faster on the largest instance. This considered proposals per second for both easy and hard
difference is also displayed in Figure 6 which gives the datasets on the OpenMP algorithms. For each instance this
speedup of the same instances. number is given as the sum over each man of his ranking of
We believe that the difference between the OpenMP al- his final partner and then divided by the total time. In sparse
gorithms can be explained by how the algorithms handle the graph algorithms this is often referred to as the number
large number of rejections. While the Gale-Shapley algo- of traversed edges per second (TEPS) and is, among other
rithm has to store each rejected man to memory and retrieve things, used to rank the performance of computers in the
a new one, the McVitie-Wilson algorithm can continue work- Graph500 challenge [9, 22].
ing on the rejected man without needing to access relatively For the easy instances the TEPS rate starts out at 10M
slow memory. The poor performance of the McVitie-Wilson for one thread and then increases to somewhere between
GPU algorithm compared to the OpenMP one is most likely 100M to 140M for 36 threads. Thus the efficiency when
due to how the machines handle contention for shared re- using 36 threads lies somewhere in the range of 30% to
sources. The GPU algorithm utilizes several thousand con- 40%. For the hard instances the TEPS rate for the McVitie-
Wilson algorithm starts out at about 150M and increases up requiring at least n2 processes, and are thus mainly of
to 2.75 billion when using 36 threads for an efficiency rate theoretical interest [7, 28].
of about 50%. As already noted the Gale-Shapley algorithm We are only aware of one previous attempt at imple-
does not scale well on these instances. Comparing the TEPS menting a parallel version of the Gale-Shapley algorithm and
rate between the easy and the hard instances when using the this did not result in any speedup [27]. Quinn [26] argues
McVitie-Wilson algorithm on the same number of threads it that one cannot expect a large speedup from a parallel Gale-
can be observed that the maximum TEPS rate is more than Shapley style algorithm in practice as the algorithm cannot
a factor of 20 larger for the hard instances. This is most run faster than the maximal number of proposals made by
likely because the hard instances are not limited by access to any one man. We note that for a random instance the average
memory as the whole dataset only consists of two vectors. number of proposals made by each man is in fact O(log n).
While the question of developing asymptotically faster
parallel algorithms than those presented in Section 4 is of
interest from a theoretical point of view, we believe that this
140M 75M McVitie-Wilson is less relevant for a practitioner. To begin with the running
75M Gale-Shapley

120M
100M McVitie-Wilson time of the Gale-Shapley algorithm is linear in the instance
100M Gale-Shapley
125M McVitie-Wilson size. Thus moderate sized problem can already be solved
125M Gale-Shapley
rapidly. In addition, our current experiments on the SMI
100M

problem as well as previously experiments on GM problems


TEPS

80M

60M shows that Gale-Shapley type algorithms scale well. One


40M reason for this is that the size of the instance n is typically
20M
much larger than the number of threads or processes used.
One notable difference between the formulations of the
0
1 9 18 27 36 GM and the SMI problem is that for GM it is not assumed
Threads
that the neighbor lists are initially sorted by decreasing
weight in the same way as priority lists are ordered in SM.
Thus work on developing parallel algorithms for the GM
Figure 7: TEPS 1for easy datasets problem has focused on how one should search the neighbor
lists. Suggested solutions include sorting the lists initially,
searching through the list each time a new candidate is
3B
400K McVitie-Wilson
needed, or something in between. All of these strategies
2.5B
400K
500K
Gale-Shapley
McVitie-Wilson
result in a running time that is superlinear in the input size.
500K Gale-Shapley However, Preis’s algorithm for GM has linear running time
2B
[25], but is more complicated and not suitable for parallel
execution. We therefore ask if it is possible to design a linear
TEPS

1.5B

time algorithm for the SM problem if the priority lists are


1B
not sorted, but instead given as real valued numbers such
0.5B that pi (j) gives the value that person i assigns to person j of
the opposite sex.
0

1 9 18 27 36

References
Threads

Figure 8: TEPS 1for hard datasets [1] B. O. Fagginger Auer and R. H. Bisseling. A GPU algo-
rithm for greedy graph matching. In Facing the multicore,
Challenge II, volume 7174, pages 108–119. LNCS, Springer,
5 Conclusion 2012.
[2] M. Baı̈ou and M. Balinski. Many-to-many matching: stable
In his book Manlove [17] lists some of the most noteworthy
polyandrous polygamy (or polygamous polyandry). Discrete
open problems related to SM. One of these is to determine
Applied Mathematics, 101(1-3):1–12, 2000.
if the SM problem is in the complexity class NC or not, [3] M. Balinski and G. Ratier. Of stable marriages and graphs,
that is, to determine whether the problem can be solved by and strategy and polytopes. SIAM Review, 39(4):575–604,
an algorithm with polylogarithmic running time when using 1997.
a polynomial number of processes. Efforts at designing [4] Ü. V. Çatalyürek, F. Dobrian, A. H. Gebremedhin, M. Ha-
such algorithms has mainly resulted in parallel algorithms lappanavar, and A. Pothen. Distributed-memory parallel al-
gorithms for matching and coloring. In IPDPS Workshops, [23] Md. Naim, F. Manne, M. Halappanavar, A. Tumeo, and
pages 1971–1980, 2011. J. Langguth. Optimizing approximate weighted matching on
[5] L. B. Wilson D. G. McVitie. The stable marriage problem. nvidia kepler K40. In 22nd IEEE International Conference on
Communications of the ACM, 14(7):486–490, 1971. High Performance Computing, HiPC 2015, Bengaluru, India,
[6] L. S. Shapley D. Gale. College admissions and the stability of December 16-19, 2015, pages 105–114, 2015.
marriage. The American Mathematical Monthly, 69(1):9–15, [24] U. Naumann and O. Schenk. Combinatorial Scientific Com-
1962. puting. CRC Press, 2012.
[7] T. Feder, N. Megiddo, and S. A. Plotkin. A sublinear parallel [25] R. Preis. Linear time 1/2-approximation algorithm for max-
algorithm for stable matching. Theor. Comput. Sci., 233(1- imum weighted matching in general graphs. In STACS’99,
2):297–308, 2000. volume 1563, pages 259–269. LNCS, Springer, 1999.
[8] G. Georgiadis and M. Papatriantafilou. Overlays with pref- [26] M. J. Quinn. Designing efficient algorithms for parallel
erences: Distributed, adaptive approximation algorithms for computers. McGraw-Hill, 1987.
matching with preference lists. Algorithms, 6(4):824–856, [27] J. L. Träff. A parallel approach to the stable marriage
2013. problem. In Proceedings of the Nordic Operations Research
[9] Graph 500. http://www.graph500.org. Conference (NOAS ’97), pages 277–287, 1997.
[10] D. Gusfield and R. W. Irving. The stable marriage problem. [28] C. White and E. Lu. An improved parallel iterative algo-
The MIT press, 1989. rithm for stable matching. Extended Abstract, Companion of
[11] M. Halappanavar, J. Feo, O. Villa, A. Tumeo, and A. Pothen. IEEE/ACM International Conference for High Performance
Approximate weighted matching on emerging manycore and Computing, Networking, Storage and Analysis (SuperCom-
multithreaded architectures. Int. J. High Perf. Comput. App., puting 2013).
26(4):413–430, 2012. [29] L. B. Wilson. An analysis of the marriage matching assign-
[12] R. W. Irving. An efficient algorithm for the ”stable room- ment algorithm. BIT, 12:569–575, 1972.
mates” problem. J. Algorithms, 6(4):577–595, 1985.
[13] R. W. Irving and S. Scott. The stable fixtures problem -
A many-to-many extension of stable roommates. Discrete
Applied Mathematics, 155(16):2118–2129, 2007.
[14] A. Khan, A. Pothen, M. M. A. Patwary, N. R. Satish, N. Sun-
daram, F. Manne, M. Halappanavar, and P. Dubey. Efficient
approximation algorithms for weighted b-matching. SIAM J.
Sci. Comput.
[15] A. M. Khan, D. F. Gleich, A. Pothen, and M. Halappanavar.
A multithreaded algorithm for network alignment via approx-
imate matching. In SC, page 64, 2012.
[16] V. Knoblauch. Marriage matching: A conjecture
of Donald Knuth. Economics Working Papers,
http://digitalcommons.uconn.edu/econ wpapers/200715,
2007.
[17] D. Manlove. Algorithmics of matching under preferences.
World Scientific, 2013.
[18] F. Manne and R. H. Bisseling. A parallel approximation
algorithm for the weighted maximum matching problem. In
PPAM’08, volume 4967 of LNCS, pages 708–717. Springer,
2008.
[19] F. Manne and M. Halappanavar. New effective multithreaded
matching algorithms. In 2014 IEEE 28th International Par-
allel and Distributed Processing Symposium, Phoenix, AZ,
USA, May 19-23, 2014, pages 519–528, 2014.
[20] J. Mestre. Greedy in approximation algorithms. In Al-
gorithms - ESA 2006, 14th Annual European Symposium,
Zurich, Switzerland, September 11-13, 2006, Proceedings,
volume 4168 of Lecture Notes in Computer Science, pages
528–539. Springer, 2006.
[21] M. Müller-Hannemann and A. Schwartz. Implementing
weighted b-matching algorithms: Insights from a computa-
tional study. J. Exp. Algorithmics, 5, December 2000.
[22] Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and
James A. Ang. Introducing the Graph 500. Cray User’s
Group (CUG), 2010.

You might also like