0% found this document useful (0 votes)
77 views501 pages

Algorithms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views501 pages

Algorithms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 501

CS 473: Algorithms¬

Sariel Har-Peled

March 8, 2019

¬ This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 171 Second
Street, Suite 300, San Francisco, California, 94105, USA.
Preface

This manuscript is a collection of class notes for the (no longer required graduate) course “473G/573/473
Graduate Algorithms” taught in the University of Illinois, Urbana-Champaign, in 1. Spring 2006, 2. Fall 07,
3. Fall 09, 4. Fall 10, 5. Fall 13, 6. Fall 14, and 7. fall 18.
Class notes for algorithms class are as common as mushrooms after a rain. I have no plan of publishing
these class notes in any form except on the web. In particular, Jeff Erickson has class notes for 374/473 which
are better written and cover some of the topics in this manuscript (but naturally, I prefer my exposition over
his).
My reasons in writing the class notes are to (i) avoid the use of a (prohibitly expensive) book in this class,
(ii) cover some topics in a way that deviates from the standard exposition, and (iii) have a clear description of
the material covered. In particular, as far as I know, no book covers all the topics discussed here. Also, this
manuscript is available (on the web) in more convenient lecture notes form, where every lecture has its own
chapter.
Most of the topics covered are core topics that I believe every graduate student in computer science should
know about. This includes NP-Completeness, dynamic programming, approximation algorithms, randomized
algorithms and linear programming. Other topics on the other hand are more optional and are nice to know
about. This includes topics like network flow, minimum-cost network flow, and union-find. Nevertheless, I
strongly believe that knowing all these topics is useful for carrying out any worthwhile research in any subfield
of computer science.
Teaching such a class always involve choosing what not to cover. Some other topics that might be worthy
of presentation include advanced data-structures, computational geometry, etc – the list goes on. Since this
course is for general consumption, more theoretical topics were left out (e.g., expanders, derandomization, etc).
In particular, these class notes cover way more than can be covered in one semester. For my own sanity, I
try to cover some new material every semester I teach this class. Furthermore, some of the notes contains more
detail than I cover in class.
In any case, these class notes should be taken for what they are. A short (and sometime dense) tour of some
key topics in algorithms. The interested reader should seek other sources to pursue them further.
If you find any typos, mistakes, errors, or lies, please email me.

Acknowledgments
(No preface is complete without them.) I would like to thank the students in the class for their input, which
helped in discovering numerous typos and errors in the manuscript. Furthermore, the content was greatly
effected by numerous insightful discussions with Chandra Chekuri, Jeff Erickson, and Edgar Ramos.
In addition, I would like to thank Qizheng He for pointing out many typos in the notes (which were fixed
more later than sooner).

Copyright
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy of
this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco, California, 94105, USA.

— Sariel Har-Peled
March 2019, Urbana, IL

1
Contents

Preface 1

Contents 2

I NP Completeness 14

1 NP Completeness I 14
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Complexity classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.1 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 More NP-Complete problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 3SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 NP Completeness II 20
2.1 Max-Clique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Independent Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Vertex Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Graph Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 NP Completeness III 25
3.1 Hamiltonian Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Traveling Salesman Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Subset Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 3 dimensional Matching (3DM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Some other problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

II Dynamic programming 30

4 Dynamic programming 30
4.1 Basic Idea - Partition Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 A Short sermon on memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Example – Fibonacci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Why, where, and when? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Computing Fibonacci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Shortest path in a DAG and dynamic programming . . . . . . . . . . . . . . . . 36

5 Dynamic programming II - The Recursion Strikes Back 37

2
5.1 Optimal search trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Optimal Triangulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Longest Ascending Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Slightly faster TSP algorithm via dynamic programming . . . . . . . . . . . . . . . . 41

III Approximation algorithms 43

6 Approximation algorithms 43
6.1 Greedy algorithms and approximation algorithms . . . . . . . . . . . . . . . . . . . . 43
6.1.1 Alternative algorithm – two for the price of one . . . . . . . . . . . . . . . . . . 45
6.2 Fixed parameter tractability, approximation, and fast exponential time algorithms
(to say nothing of the dog) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 A silly brute force algorithm for vertex cover . . . . . . . . . . . . . . . . . . . . 45
6.2.2 A fixed parameter tractable algorithm . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.2.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Approximating maximum matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Graph diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5 Traveling Salesman Person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5.1 TSP with the triangle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.5.1.1 A 2-approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.5.1.2 A 3/2-approximation to TSP 4, -Min . . . . . . . . . . . . . . . . . . . . . . 50
6.6 Biographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Approximation algorithms II 51
7.1 Max Exact 3SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Approximation Algorithms for Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.1 Guarding an Art Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.2 Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.3 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.4 Just for fun – weighted set cover . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 Biographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8 Approximation algorithms III 57


8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1.1 The approximation algorithm for k-center clustering . . . . . . . . . . . . . . . . 58
8.2 Subset Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2.1 On the complexity of ε-approximation algorithms . . . . . . . . . . . . . . . . . 60
8.2.2 Approximating subset-sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2.2.1 Bounding the running time of ApproxSubsetSum . . . . . . . . . . . . . . 62
8.2.2.2 The result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.3 Approximate Bin Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

IV Randomized algorithms 65

9 Randomized Algorithms 65

3
9.1 Some Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Sorting Nuts and Bolts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.2.1 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.2.1.1 Alternative incorrect solution . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2.2 What are randomized algorithms? . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.3 Analyzing QuickSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.4 QuickSelect – median selection in linear time . . . . . . . . . . . . . . . . . . . . . . . 69

10 Randomized Algorithms II 71
10.1 QuickSort and Treaps with High Probability . . . . . . . . . . . . . . . . . . . . . . . 71
10.1.1 Proving that an element participates in small number of rounds . . . . . . . . . 71
10.1.2 An alternative proof of the high probability of QuickSort . . . . . . . . . . . . . 73
10.2 Chernoff inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.2.2 Chernoff inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.2.2.1 The Chernoff Bound — General Case . . . . . . . . . . . . . . . . . . . . . 75
10.3 Treaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.3.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.3.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.3.2.1 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3.2.2 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3.2.3 Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3.2.4 Meld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3.3 Summery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
10.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

11 Hashing 78
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2 Universal Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.2.1 How to build a 2-universal family . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.2.1.1 On working modulo prime . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.2.1.2 Constructing a family of 2-universal hash functions . . . . . . . . . . . . . 82
11.2.1.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.2.1.4 Explanation via pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
11.3 Perfect hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
11.3.1 Some easy calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
11.3.2 Construction of perfect hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.3.2.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.4 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
11.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

12 Min Cut 87
12.1 Branching processes – Galton-Watson Process . . . . . . . . . . . . . . . . . . . . . . 87
12.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
12.1.2 On coloring trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
12.2 Min Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
12.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
12.2.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
12.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
12.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4
12.3.1.1 The probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
12.3.1.2 Running time analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
12.4 A faster algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
12.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

V Network flow 95

13 Network Flow 95
13.1 Network Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
13.2 Some properties of flows and residual networks . . . . . . . . . . . . . . . . . . . . . . 96
13.3 The Ford-Fulkerson method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
13.4 On maximum flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

14 Network Flow II - The Vengeance 100


14.1 Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
14.2 The Ford-Fulkerson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
14.3 The Edmonds-Karp algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
14.4 Applications and extensions for Network Flow . . . . . . . . . . . . . . . . . . . . . . 103
14.4.1 Maximum Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
14.4.2 Extension: Multiple Sources and Sinks . . . . . . . . . . . . . . . . . . . . . . . 104

15 Network Flow III - Applications 105


15.1 Edge disjoint paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
15.1.1 Edge-disjoint paths in a directed graphs . . . . . . . . . . . . . . . . . . . . . . 105
15.1.2 Edge-disjoint paths in undirected graphs . . . . . . . . . . . . . . . . . . . . . . 106
15.2 Circulations with demands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
15.2.1 Circulations with demands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
15.2.1.1 The algorithm for computing a circulation . . . . . . . . . . . . . . . . . . 107
15.3 Circulations with demands and lower bounds . . . . . . . . . . . . . . . . . . . . . . . 107
15.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15.4.1 Survey design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

16 Network Flow IV - Applications II 109


16.1 Airline Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
16.1.1 Modeling the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
16.1.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
16.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
16.3 Project Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
16.3.1 The reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
16.4 Baseball elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
16.4.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
16.4.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
16.4.3 A compact proof of a team being eliminated . . . . . . . . . . . . . . . . . . . . 116

17 Network Flow V - Min-cost flow 117


17.1 Minimum Average Cost Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
17.2 Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
17.3 Minimum cost flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
17.4 Strongly Polynomial Time Algorithm for Min-Cost Flow . . . . . . . . . . . . . . . . 122
17.5 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5
17.5.1 Reduced cost induced by a circulation . . . . . . . . . . . . . . . . . . . . . . . . 123
17.5.2 Bounding the number of iterations . . . . . . . . . . . . . . . . . . . . . . . . . 123
17.6 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

18 Network Flow VI - Min-Cost Flow Applications 125


18.1 Efficient Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
18.2 Efficient Flow with Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
18.3 Shortest Edge-Disjoint Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
18.4 Covering by Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
18.5 Minimum weight bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
18.6 The transportation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

VI Linear Programming 129

19 Linear Programming in Low Dimensions 129


19.1 Some geometry first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
19.2 Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
19.2.1 A solution and how to verify it . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.3 Low-dimensional linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
19.3.1 An algorithm for a restricted case . . . . . . . . . . . . . . . . . . . . . . . . . . 132
19.3.1.1 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
19.3.2 The algorithm for the general case . . . . . . . . . . . . . . . . . . . . . . . . . . 134

20 Linear Programming 135


20.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
20.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
20.1.2 Network flow via linear programming . . . . . . . . . . . . . . . . . . . . . . . . 136
20.2 The Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
20.2.1 Linear program where all the variables are positive . . . . . . . . . . . . . . . . 137
20.2.2 Standard form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
20.2.3 Slack Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
20.2.4 The Simplex algorithm by example . . . . . . . . . . . . . . . . . . . . . . . . . 138
20.2.4.1 Starting somewhere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

21 Linear Programming II 141


21.1 The Simplex Algorithm in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
21.2 The SimplexInner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
21.2.1 Degeneracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
21.2.2 Correctness of linear programming . . . . . . . . . . . . . . . . . . . . . . . . . 144
21.2.3 On the ellipsoid method and interior point methods . . . . . . . . . . . . . . . . 144
21.3 Duality and Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
21.3.1 Duality by Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
21.3.2 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
21.3.3 The Weak Duality Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
21.4 The strong duality theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
21.5 Some duality examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
21.5.1 Shortest path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
21.5.2 Set Cover and Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
21.5.3 Network flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
21.6 Solving LPs without ever getting into a loop - symbolic perturbations . . . . . . . . . 151

6
21.6.1 The problem and the basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
21.6.2 Pivoting as a Gauss elimination step . . . . . . . . . . . . . . . . . . . . . . . . 152
21.6.2.1 Back to the perturbation scheme . . . . . . . . . . . . . . . . . . . . . . . . 152
21.6.2.2 The overall algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

22 Approximation Algorithms using Linear Programming 153


22.1 Weighted vertex cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
22.2 Revisiting Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
22.3 Minimizing congestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

VII Miscellaneous topics 158

23 Fast Fourier Transform 158


23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
23.2 Computing a polynomial quickly on n values . . . . . . . . . . . . . . . . . . . . . . . 159
23.2.1 Generating Collapsible Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
23.3 Recovering the polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
23.4 The Convolution Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
23.4.1 Complex numbers – a quick reminder . . . . . . . . . . . . . . . . . . . . . . . . 164

24 Sorting Networks 164


24.1 Model of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
24.2 Sorting with a circuit – a naive solution . . . . . . . . . . . . . . . . . . . . . . . . . 165
24.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
24.2.2 Sorting network based on insertion sort . . . . . . . . . . . . . . . . . . . . . . . 166
24.3 The Zero-One Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
24.4 A bitonic sorting network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
24.4.1 Merging sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
24.5 Sorting Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
24.6 Faster sorting networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

25 Union Find 171


25.1 Union-Find . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
25.1.1 Requirements from the data-structure . . . . . . . . . . . . . . . . . . . . . . . . 171
25.1.2 Amortized analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
25.1.3 The data-structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
25.2 Analyzing the Union-Find Data-Structure . . . . . . . . . . . . . . . . . . . . . . . . 173

26 Approximate Max Cut 176


26.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
26.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
26.2 Semi-definite programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
26.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

27 The Perceptron Algorithm 180


27.1 The perceptron algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
27.2 Learning A Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
27.3 A Little Bit On VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
27.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7
VIII Compression, entropy, and randomness 186

28 Huffman Coding 186


28.1 Huffman coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
28.1.1 The algorithm to build Hoffman’s code . . . . . . . . . . . . . . . . . . . . . . . 188
28.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
28.1.3 What do we get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
28.1.4 A formula for the average size of a code word . . . . . . . . . . . . . . . . . . . 190
28.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

29 Entropy, Randomness, and Information 191


29.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
29.1.1 Extracting randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

30 Even more on Entropy, Randomness, and Information 194


30.1 Extracting randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
30.1.1 Enumerating binary strings with j ones . . . . . . . . . . . . . . . . . . . . . . . 195
30.1.2 Extracting randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
30.2 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

31 Shannon’s theorem 196


31.1 Coding: Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
31.1.0.1 Intuition behind Shanon’s theorem . . . . . . . . . . . . . . . . . . . . . . 197
31.1.0.2 What is wrong with the above? . . . . . . . . . . . . . . . . . . . . . . . . 198
31.2 Proof of Shannon’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
31.2.1 How to encode and decode efficiently . . . . . . . . . . . . . . . . . . . . . . . . 198
31.2.1.1 The scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
31.2.1.2 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
31.2.2 Lower bound on the message size . . . . . . . . . . . . . . . . . . . . . . . . . . 202
31.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

IX Miscellaneous topics II 203

32 Matchings 203
32.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
32.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
32.1.2 Matchings and alternating paths . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
32.2 Unweighted matching in bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . 205
32.2.1 The slow algorithm; algSlowMatch . . . . . . . . . . . . . . . . . . . . . . . . . 205
32.2.2 The Hopcroft-Karp algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
32.2.2.1 Some more structural observations . . . . . . . . . . . . . . . . . . . . . . . 206
32.2.2.2 Improved algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
32.2.2.3 Extracting many augmenting paths: algExtManyPaths . . . . . . . . . . . 207
32.2.2.4 The result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
32.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

33 Matchings II 210
33.1 Maximum weight matchings in a bipartite graph . . . . . . . . . . . . . . . . . . . . . 211
33.1.1 On the structure of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
33.1.2 Maximum Weight Matchings in a bipartite Graph . . . . . . . . . . . . . . . . . 212

8
33.1.2.1 Building the residual graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
33.1.2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
33.1.3 Faster Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
33.2 The Bellman-Ford algorithm - a quick reminder . . . . . . . . . . . . . . . . . . . . . 213
33.3 Maximum size matching in a non-bipartite graph . . . . . . . . . . . . . . . . . . . . 213
33.3.1 Finding an augmenting path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
33.3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
33.3.2.1 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
33.4 Maximum weight matching in a non-bipartite graph . . . . . . . . . . . . . . . . . . . 217

34 Lower Bounds 217


34.1 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
34.1.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
34.1.2 An easier direct argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
34.2 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
34.3 Other lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
34.3.1 Algebraic tree model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
34.3.2 3Sum-Hard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

35 Backwards analysis 221


35.1 How many times can the minimum change? . . . . . . . . . . . . . . . . . . . . . . . 222
35.2 Yet another analysis of QuickSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
35.3 Closest pair: Backward analysis in action . . . . . . . . . . . . . . . . . . . . . . . . . 223
35.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
35.3.2 Back to the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
35.3.3 Slow algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
35.3.4 Linear time algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
35.4 Computing a good ordering of the vertices of a graph . . . . . . . . . . . . . . . . . . 226
35.4.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
35.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
35.5 Computing nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
35.5.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
35.5.1.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
35.5.1.2 Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
35.5.2 Computing nets quickly for a point set in Rd . . . . . . . . . . . . . . . . . . . . 227
35.5.3 Computing an r-net in a sparse graph . . . . . . . . . . . . . . . . . . . . . . . . 228
35.5.3.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
35.5.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
35.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

36 Linear time algorithms 230


36.1 The lowest point above a set of lines . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
36.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

37 Streaming 233
37.1 How to sample a stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
37.2 Sampling and median selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
37.2.1 A median selection with few comparisons . . . . . . . . . . . . . . . . . . . . . . 235
37.3 Big data and the streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
37.4 Heavy hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
37.5 Chernoff inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

9
X Exercises 237

38 Exercises - Prerequisites 237


38.1 Graph Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
38.2 Recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
38.3 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
38.4 O notation and friends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
38.5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
38.6 Basic data-structures and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
38.7 General proof thingies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
38.8 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

39 Exercises - NP Completeness 248


39.1 Equivalence of optimization and decision problems . . . . . . . . . . . . . . . . . . . 248
39.2 Showing problems are NP-Complete . . . . . . . . . . . . . . . . . . . . . . . . . . 249
39.3 Solving special subcases of NP-Complete problems in polynomial time . . . . . . . 250

40 Exercises - Network Flow 258


40.1 Network Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
40.2 Min Cost Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

41 Exercises - Miscellaneous 269


41.1 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
41.2 Divide and Conqueror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
41.3 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
41.4 Union-Find . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
41.5 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
41.6 Number theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
41.7 Sorting networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
41.8 Max Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

42 Exercises - Approximation Algorithms 275


42.1 Greedy algorithms as approximation algorithms . . . . . . . . . . . . . . . . . . . . . 275
42.2 Approximation for hard problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

43 Randomized Algorithms 278


43.1 Randomized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

44 Exercises - Linear Programming 281


44.1 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
44.2 Tedious . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

45 Exercises - Computational Geometry 285


45.1 Misc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

46 Exercises - Entropy 287

XI Homeworks/midterm/final 289

47 Fall 2001 289


47.1 Homeworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

10
47.1.1 Homework 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
47.1.2 Homework 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
47.1.3 Homework 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
47.1.4 Homework 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
47.1.5 Homework 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
47.1.6 Homework 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
47.1.7 Homework 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
47.2 Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
47.3 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

48 Spring 2002 334

49 Fall 2002 334

50 Spring 2003 334


50.1 Homeworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
50.1.1 Homework 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
50.1.2 Homework 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
50.1.3 Homework 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
50.1.4 Homework 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
50.1.5 Homework 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
50.1.6 Homework 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
50.1.7 Homework 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
50.2 Midterm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
50.3 Midterm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
50.4 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

51 Fall 2003 385

52 Spring 2005 385

53 Spring 2006 385

54 Fall 2007 385

55 Fall 2009 385

56 Spring 2011 385

57 Fall 2011 385

58 Fall 2012 385

59 Fall 2013 385

60 Spring 2013: CS 473: Fundamental algorithms 385


60.1 Homeworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
60.1.1 Homework 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
60.1.2 Homework 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
60.1.3 Homework 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
60.1.4 Homework 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
60.1.5 Homework 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

11
60.1.6 Homework 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
60.1.7 Homework 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
60.1.8 Homework 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
60.1.9 Homework 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
60.1.10 Homework 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
60.1.11 Homework 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
60.2 Midterm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
60.3 Midterm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
60.4 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

61 Fall 2014: CS 573 – Graduate algorithms 413


61.1 Homeworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
61.1.1 Homework 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
61.1.2 Homework 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
61.1.3 Homework 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
61.1.4 Homework 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
61.1.5 Homework 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
61.1.6 Homework 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
61.2 Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
61.3 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

62 Fall 2015: CS 473 – Theory II 430


62.1 Homeworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
62.1.1 Homework 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
62.1.2 Homework 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
62.1.3 Homework 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
62.1.4 Homework 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
62.1.5 Homework 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
62.1.6 Homework 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
62.1.7 Homework 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
62.1.8 Homework 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
62.1.9 Homework 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
62.1.10 Homework 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
62.1.11 Homework 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
62.1.12 Homework 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
62.2 Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
62.3 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451

63 Fall 2018: CS 473 Algorithms 451


63.1 Homeworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
63.1.1 Homework 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
63.1.2 Homework 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
63.1.3 Homework 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
63.1.4 Homework 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
63.1.5 Homework 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
63.1.6 Homework 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
63.1.7 Homework 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
63.1.8 Homework 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
63.1.9 Homework 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
63.1.10 Homework 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

12
63.1.11 Homework 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
63.2 Midterm I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
63.3 Midterm II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
63.4 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479

Bibliography 491

Index 494

13
Part I
NP Completeness

Chapter 1

NP Completeness I

"Then you must begin a reading program immediately so that you man understand the crises of our age," Ignatius said
solemnly. "Begin with the late Romans, including Boethius, of course. Then you should dip rather extensively into
early Medieval. You may skip the Renaissance and the Enlightenment. That is mostly dangerous propaganda. Now,
that I think about of it, you had better skip the Romantics and the Victorians, too. For the contemporary period,
you should study some selected comic books."
"You’re fantastic."
"I recommend Batman especially, for he tends to transcend the abysmal society in which he’s found himself. His
morality is rather rigid, also. I rather respect Batman."
– A confederacy of Dunces, John Kennedy Toole.

1.1. Introduction
The question governing this course, would be the development of efficient algorithms. Hopefully, what is an
algorithm is a well understood concept. But what is an efficient algorithm? A natural answer (but not the only
one!) is an algorithm that runs quickly.
What do we mean by quickly? Well, we would like our algorithm to:
(A) Scale with input size. That is, it should be able to handle large and hopefully huge inputs.
(B) Low level implementation details should not matter, since they correspond to small improvements in
performance. Since faster CPUs keep appearing it follows that such improvements would (usually) be
taken care of by hardware.
(C) What we will really care about are asymptotic running time. Explicitly, polynomial time.
In our discussion, we will consider the input size to be n, and we would like to bound the overall running
time by a function of n which is asymptotically as small as possible. An algorithm with better asymptotic
running time would be considered to be better.

Example 1.1.1. It is illuminating to consider a concrete example. So assume we have an algorithm for a problem
that needs to perform c2n operations to handle an input of size n, where c is a small constant (say 10). Let
assume that we have a CPU that can do 109 operations a second. (A somewhat conservative assumption, as

14
Input size n2 ops n3 ops n4 ops 2n ops n! ops
5 0 secs 0 secs 0 secs 0 secs 0 secs
20 0 secs 0 secs 0 secs 0 secs 16 mins
30 0 secs 0 secs 0 secs 0 secs 3 · 109 years
50 0 secs 0 secs 0 secs 0 secs never
60 0 secs 0 secs 0 secs 7 mins never
70 0 secs 0 secs 0 secs 5 days never
80 0 secs 0 secs 0 secs 15.3 years never
90 0 secs 0 secs 0 secs 15,701 years never
100 0 secs 0 secs 0 secs 107 years never
8000 0 secs 0 secs 1 secs never never
16000 0 secs 0 secs 26 secs never never
32000 0 secs 0 secs 6 mins never never
64000 0 secs 0 secs 111 mins never never
200,000 0 secs 3 secs 7 days never never
2,000,000 0 secs 53 mins 202.943 years never never
108 4 secs 12.6839 years 109 years never never
109 6 mins 12683.9 years 1013 years never never

Figure 1.1: Running time as function of input size. Algorithms with exponential running times can handle
only relatively small inputs. We assume here that the computer can do 2.5 · 1015 operations per second, and
the functions are the exact number of operations performed. Remember – never is a long time to wait for a
computation to be completed.

currently [Jan 2006]¬ , the blue-gene supercomputer can do about 3 · 1014 floating-point operations a second.
Since this super computer has about 131, 072 CPUs, it is not something you would have on your desktop any
time soon.) Since 210 ≈ 103 , you have that our (cheap) computer can solve in (roughly) 10 seconds a problem
of size n = 27.
But what if we increase the problem size to n = 54? This would take our computer about 3 million years to
solve. (It is better to just wait for faster computers to show up, and then try to solve the problem. Although
there are good reasons to believe that the exponential growth in computer performance we saw in the last 40
years is about to end. Thus, unless a substantial breakthrough in computing happens, it might be that solving
problems of size, say, n = 100 for this problem would forever be outside our reach.)
The situation dramatically change if we consider an algorithm with running time 10n2 . Then, in one second
our computer can handle input of size n = 104 . Problem of size n = 108 can be solved in 10n2 /109 = 1017−9 = 108
which is about 3 years of computing (but blue-gene might be able to solve it in less than 20 minutes!).
Thus, algorithms that have asymptotically a polynomial running time (i.e., the algorithms running time is
bounded by O(nc ) where c is a constant) are able to solve large instances of the input and can solve the problem
even if the problem size increases dramatically.

Can we solve all problems in polynomial time? The answer to this question is unfortunately no. There
are several synthetic examples of this, but it is believed that a large class of important problems can not be
solved in polynomial time.

Circuit Satisfiability
Instance: A circuit C with m inputs
Question: Is there an input for C such that C returns true for it.

¬ But the recently announced Super Computer that would be completed in 2012 in Urbana, is naturally way faster. It supposedly
would do 1015 operations a second (i.e., petaflop). Blue-gene probably can not sustain its theoretical speed stated above, which is
only slightly slower.

15
As a concrete example, consider the circuit depicted on the right. x1
Currently, all solutions known to Circuit Satisfiability require checking all x2
possibilities, requiring (roughly) 2m time. Which is exponential time and too x3
slow to be useful in solving large instances of the problem. x4
This leads us to the most important open question in theoretical computer x5
science: x x∧y x x∨y x x
y y

Question 1.1.2. Can one solve Circuit Satisfiability in polynomial time? And Or Not

The common belief is that Circuit Satisfiability can NOT be solved in polynomial time. Circuit Satisfiability
has two interesting properties.
(A) Given a supposed positive solution, with a detailed assignment (i.e., proof): x1 ← 0, x2 ← 1, ..., xm ← 1
one can verify in polynomial time if this assignment really satisfies C. This is done by computing what
every gate in the circuit what its output is for this input. Thus, computing the output of C for its input.
This requires evaluating the gates of C in the right order, and there are some technicalities involved, which
we are ignoring. (But you should verify that you know how to write a program that does that efficiently.)
Intuitively, this is the difference in hardness between coming up with a proof (hard), and checking that a
proof is correct (easy).
(B) It is a decision problem. For a specific input an algorithm that solves this problem has to output either
TRUE or FALSE.

1.2. Complexity classes


Definition 1.2.1 (P: Polynomial time). Let P denote is the class of all decision problems that can be solved in
polynomial time in the size of the input.

Definition 1.2.2 (NP: Nondeterministic Polynomial time). Let NP be the class of all decision problems that can
be verified in polynomial time. Namely, for an input of size n, if the solution to the given instance is true, one
(i.e., an oracle) can provide you with a proof (of polynomial length!) that the answer is indeed TRUE for this
instance. Furthermore, you can verify this proof in polynomial time in the length of the proof.

Clearly, if a decision problem can be solved in polynomial time,


then it can be verified in polynomial time. Thus, P ⊆ NP.

Remark. The notation NP stands for Non-deterministic Polynomial.


The name come from a formal definition of this class using Turing
machines where the machines first guesses (i.e., the non-deterministic
stage) the proof that the instance is TRUE, and then the algorithm
verifies the proof.
Figure 1.2: The relation between
Definition 1.2.3 (co-NP). The class co-NP is the opposite of NP – if
the different complexity classes
the answer is FALSE, then there exists a short proof for this negative
P, NP, and co-NP.
answer, and this proof can be verified in polynomial time.

See Figure 1.2 for the currently believed relationship between these classes (of course, as mentioned above,
P ⊆ NP and P ⊆ co-NP is easy to verify). Note, that it is quite possible that P = NP = co-NP, although
this would be extremely surprising.

Definition 1.2.4. A problem Π is NP-Hard, if being able to solve Π in polynomial time implies that P = NP.

Question 1.2.5. Are there any problems which are NP-Hard?

16
Intuitively, being NP-Hard implies that a problem is ridiculously hard. Conceptually, it would imply that
proving and verifying are equally hard - which nobody that did CS 473 believes is true.
In particular, a problem which is NP-Hard is at least as hard as ALL the problems in NP, as such it is
safe to assume, based on overwhelming evidence that it can not be solved in polynomial time.
Theorem 1.2.6 (Cook’s Theorem). Circuit Satisfiability is NP-Hard.

Definition 1.2.7. A problem Π is NP-Complete (NPC in short) if it is both NP-Hard and in NP.

Clearly, Circuit Satisfiability is NP-Complete, since we can verify a positive solution in polynomial time in
the size of the circuit,

By now, thousands of problems have been shown to be NP- NP-Hard


Complete. It is extremely unlikely that any of them can be solved
in polynomial time.
co-NP
Definition 1.2.8. In the formula satisfiability problem, (a.k.a. SAT)
we are given a formula, for example: NP
    P
a ∨ b ∨ c ∨ d ⇐⇒ (b ∧ c) ∨ (a ⇒ d) ∨ (c , a ∧ b)
NP-Complete
and the question is whether we can find an assignment to the vari-
ables a, b, c, . . . such that the formula evaluates to TRUE. Figure 1.3: The relation between the com-
plexity classes.
It seems that SAT and Circuit Satisfiability are “similar” and as such both should be NP-Hard.
Remark 1.2.9. Cook’s theorem implies something somewhat stronger than implied by the above statement.
Specifically, for any problem in NP, there is a polynomial time reduction to Circuit Satisfiability. Thus, the
reader can think about NPC problems has being equivalent under polynomial time reductions.

1.2.1. Reductions
Let A and B be two decision problems.
Given an input I for problem A, a reduction is a transformation of the input I into a new input I 0, such that

A(I) is TRUE ⇔ B(I 0) is TRUE.

Thus, one can solve A by first transforming and input I into an input I 0 of B, and solving B(I 0).
This idea of using reductions is omnipresent, and used almost in any program you write.
Let T : I → I 0 be the input transformation that maps A into B. How fast is T? Well, for our nefarious
purposes we need polynomial reductions; that is, reductions that take polynomial time.
For example, given an instance of Circuit Satisfiability, we would like to generate an equivalent formula. We
will explicitly write down what the circuit computes in a formula form. To see how to do this, consider the
following example.
x1 y1 y4
y5 y1 = x1 ∧ x4 y2 = x4 y3 = y2 ∧ x3
x2
y4 = x2 ∨ y1 y5 = x2 y6 = x5
x3 y7 y8 y7 = y3 ∨ y5 y8 = y4 ∧ y7 ∧ y6 y8
x4 y3
y2
x5 y6

We introduced a variable for each wire in the circuit, and we wrote down explicitly what each gate computes.
Namely, we wrote a formula for each gate, which holds only if the gate computes correctly the output for its
given input.
17
The circuit is satisfiable if and only if there is an Input: boolean circuit C
assignment such that all the above formulas hold. Al- ⇓ O(size o f C)
ternatively, the circuit is satisfiable if and only if the transform C into boolean formula F
following (single) formula is satisfiable ⇓
(y1 = x1 ∧ x4 ) ∧ (y2 = x4 ) ∧ (y3 = y2 ∧ x3 ) Find SAT assign’ for F using SAT solver
∧(y4 = x2 ∨ y1 ) ∧ (y5 = x2 )

∧(y6 = x5 ) ∧ (y7 = y3 ∨ y5 )
Return TRUE if F is sat’, otherwise FALSE.
∧(y8 = y4 ∧ y7 ∧ y6 ) ∧ y8 .
Figure 1.4: Algorithm for solving CSAT using
It is easy to verify that this transformation can be done an algorithm that solves the SAT problem
in polynomial time.
The resulting reduction is depicted in Figure 1.4.
Namely, given a solver for SAT that runs in TSAT (n),
we can solve the CSAT problem in time

TCS AT (n) ≤ O(n) + TS AT (O(n)),

where n is the size of the input circuit. Namely, if we have polynomial time algorithm that solves SAT then we
can solve CSAT in polynomial time.
Another way of looking at it, is that we believe that solving CSAT requires exponential time; namely,
TCSAT (n) ≥ 2n . Which implies by the above reduction that

2n ≤ TCS AT (n) ≤ O(n) + TS AT (O(n)).

Namely, TSAT (n) ≥ 2n/c −O(n), where c is some positive constant. Namely, if we believe that we need exponential
time to solve CSAT then we need exponential time to solve SAT.
This implies that if SAT ∈ P then CSAT ∈ P.
We just proved that SAT is as hard as CSAT. Clearly, SAT ∈ NP which implies the following theorem.
Theorem 1.2.10. SAT (formula satisfiability) is NP-Complete.

1.3. More NP-Complete problems


1.3.1. 3SAT
A boolean formula is in conjunctive normal form (CNF) if it is a conjunction (AND) of several clauses, where a
clause is the disjunction (or) of several literals, and a literal is either a variable or a negation of a variable. For
example, the following is a CNF formula:
clause
z }| {
(a ∨ b ∨ c) ∧(a ∨ e) ∧ (c ∨ e).

Definition 1.3.1. 3CNF formula is a CNF formula with exactly three literals in each clause.

The problem 3SAT is formula satisfiability when the formula is restricted to be a 3CNF formula.
Theorem 1.3.2. 3SAT is NP-Complete.

Proof: First, it is easy to verify that 3SAT is in NP.


Next, we will show that 3SAT is NP-Complete by a reduction from CSAT (i.e., Circuit Satisfiability). As
such, our input is a circuit C of size n. We will transform it into a 3CNF in several steps:

18
(A) Make sure every AND/OR gate has only two inputs. If (say) an AND gate have more inputs, we replace
it by cascaded tree of AND gates, each one of degree two.
(B) Write down the circuit as a formula by traversing the circuit, as was done for SAT. Let F be the resulting
formula.
A clause corresponding to a gate in F will be of the following forms: (i) a = b ∧ c if it corresponds to an
AND gate, (ii) a = b ∨ c if it corresponds to an OR gate, and (iii) a = b if it corresponds to a NOT gate.
Notice, that except for the single clause corresponding to the output of the circuit, all clauses are of this
form. The clause that corresponds to the output is a single variable.
(C) Change every gate clause into several CNF clauses.
(i) For example, an AND gate clause of the form a = b ∧ c will be translated into
 
a ∨ b ∨ c ∧ (a ∨ b) ∧ (a ∨ c). (1.1)

Note that Eq. (1.1) is true if and only if a = b ∧ c is true. Namely, we can replace the clause a = b ∧ c
in F by Eq. (1.1).
(ii) Similarly, an OR gate clause the form a = b ∨ c in F will be transformed into

(a ∨ b ∨ c) ∧ (a ∨ b) ∧ (a ∨ c).

(iii) Finally, a clause a = b, corresponding to a NOT gate, will be transformed into

(a ∨ b) ∧ (a ∨ b).

(D) Make sure every clause is exactly three literals. Thus, a single variable clause a would be replaced by

(a ∨ x ∨ y) ∧ (a ∨ x ∨ y) ∧ (a ∨ x ∨ y) ∧ (a ∨ x ∨ y),

by introducing two new dummy variables x and y. And a two variable clause a ∨ b would be replaced by

(a ∨ b ∨ y) ∧ (a ∨ b ∨ y),

by introducing the dummy variable y.

This completes the reduction, and results in a new 3CNF formula G which is satisfiable if and only if the original
circuit C is satisfiable. The reduction is depicted in Figure 1.5. Namely, we generated an equivalent 3CNF to
the original circuit. We conclude that if T3SAT (n) is the time required to solve 3SAT then

TCS AT (n) ≤ O(n) + T3S AT (O(n)),

which implies that if we have a polynomial time algorithm for 3SAT, we would solve CSAT is polynomial time.
Namely, 3SAT is NP-Complete.

Input: boolean circuit


⇓ O(n)
3CNF formula

Decide if given formula is satsifiable using 3SAT solver

Return TRUE or FALSE
Figure 1.5: Reduction from CSAT to 3SAT

19
1.4. Bibliographical Notes
Cook’s theorem was proved by Stephen Cook (http://en.wikipedia.org/wiki/Stephen_Cook). It was proved
independently by Leonid Levin (http://en.wikipedia.org/wiki/Leonid_Levin) more or less in the same
time. Thus, this theorem should be referred to as the Cook-Levin theorem.
The standard text on this topic is [GJ90]. Another useful book is [ACG+ 99], which is a more recent and
more updated, and contain more advanced stuff.

Chapter 2

NP Completeness II

2.1. Max-Clique

We remind the reader, that a clique is a complete graph, where every


pair of vertices are connected by an edge. The MaxClique problem asks
what is the largest clique appearing as a subgraph of G. See Figure 2.1.

MaxClique
Instance: A graph G
Question: What is the largest number of nodes in G forming
a complete subgraph? Figure 2.1: A clique of size 4
inside a graph with 8 vertices.

Note that MaxClique is an optimization problem, since the output of


the algorithm is a number and not just true/false.
The first natural question, is how to solve MaxClique. A naive algorithm
would work by enumerating all subsets S ⊆ V(G), checking for each such subset S if it induces a clique in G (i.e.,
all pairs of vertices in S are connected by an edge of G). If so, we know that GS is a clique, where GS denotes
the induced subgraph on S defined by G; that is, the graph formed by removing all the vertices are not in S
from G (in particular, only edges that have both endpoints in S appear in GS ). Finally, our algorithm would
return the largest S encountered, such that GS is a clique. The running time of this algorithm is O 2n n2 as


can be easily verified.


Suggestion 2.1.1. When solving any algorithmic problem, always try first to find a simple (or even naive) T IP
solution. You can try optimizing it later, but even a naive solution might give you useful insight into a problem
structure and behavior.
We will prove that MaxClique is NP-Hard. Before dwelling into that, the simple algorithm we devised for
MaxClique shade some light on why intuitively it should be NP-Hard: It does not seem like there is any way of
avoiding the brute force enumeration of all possible subsets of the vertices of G. Thus, a problem is NP-Hard
or NP-Complete, intuitively, if the only way we know how to solve the problem is to use naive brute force
enumeration of all relevant possibilities.

20
How to prove that a problem X is NP-Hard? Proving that a given problem X is NP-Hard is usually
done in two steps. First, we pick a known NP-Complete problem A. Next, we show how to solve any instance
of A in polynomial time, assuming that we are given a polynomial time algorithm that solves X.
Proving that a problem X is NP-Complete requires the additional burden of showing that is in NP. Note,
that only decision problems can be NP-Complete, but optimization problems can be NP-Hard; namely, the
set of NP-Hard problems is much bigger than the set of NP-Complete problems.
Theorem 2.1.2. MaxClique is NP-Hard.

Proof: We show a reduction from 3SAT. So, consider an input to 3SAT, which is a formula F defined over n
variables (and with m clauses).
We build a graph from the formula F by scanning it, as follows: a b c

(i) For every literal in the formula we generate a ver- a b


tex, and label the vertex with the literal it corre-
sponds to. b c

Note, that every clause corresponds to the three


d d
such vertices.
(ii) We connect two vertices in the graph, if they are:
a c d
(i) in different clauses, and (ii) they are not a nega-
tion of each other. Figure 2.2: The generated
Let G denote the resulting graph. See Figure 2.2 for a concrete example. graph for the formula (a∨b∨c)∧
Note, that this reduction can be easily be done in quadratic time in the size (b∨c∨ d)∧(a∨c∨ d)∧(a∨ b∨ d).
of the given formula.
We claim that F is satisfiable iff there exists a clique of size m in G.
=⇒ Let x1, . . . , xn be the variables appearing in F, and let v1, . . . , vn ∈ {0, 1} be the satisfying assignment for
F. Namely, the formula F holds if we set xi = vi , for i = 1, . . . , n.
For every clause C in F there must be at least one literal that evaluates to TRUE. Pick a vertex that
corresponds to such TRUE value from each clause. Let W be the resulting set of vertices. Clearly, W forms
a clique in G. The set W is of size m, since there are m clauses and each one contribute one vertex to the
clique.
⇐= Let U be the set of m vertices which form a clique in G.
We need to translate the clique GU into a satisfying assignment of F.
(i) set xi ← TRUE if there is a vertex in U labeled with xi .
(ii) set xi ← FALSE if there is a vertex in U labeled with xi .
This is a valid assignment as can be easily verified. Indeed, assume for the sake of contradiction, that
there is a variable xi such that there are two vertices u, v in U labeled with xi and xi ; namely, we are
trying to assign to contradictory values of xi . But then, u and v, by construction will not be connected in
G, and as such GS is not a clique. A contradiction.
Furthermore, this is a satisfying assignment as there is at least one vertex of U in each clause. Implying,
that there is a literal evaluating to TRUE in each clause. Namely, F evaluates to TRUE.
Thus, given a polytime (i.e., polynomial time) algorithm for MaxClique,
we can solve 3SAT in polytime. We conclude that MaxClique in NP-Hard.

MaxClique is an optimization problem, but it can be easily restated as a decision problem.


Clique
Instance: A graph G, integer k
Question: Is there a clique in G of size k?

21
(a) (b) (c)

Figure 2.3: (a) A clique in a graph G, (b) the complement graph is formed by all the edges not appearing in G,
and (c) the complement graph and the independent set corresponding to the clique in G.

Theorem 2.1.3. Clique is NP-Complete.

Proof: It is NP-Hard by the reduction of Theorem 2.1.2. Thus, we only need to show that it is in NP. This
is quite easy. Indeed, given a graph G having n vertices, a parameter k, and a set W of k vertices, verifying
that every pair of vertices in W form an edge in G takes O(u + k 2 ), where u is the size of the input (i.e., number
of edges + number of vertices). Namely, verifying a positive answer to an instance of Clique can be done in
polynomial time.
Thus, Clique is NP-Complete.

2.2. Independent Set


Definition 2.2.1. A set S of nodes in a graph G = (V, E) is an independent set, if no pair of vertices in S are
connected by an edge.

Independent Set
Instance: A graph G, integer k
Question: Is there an independent set in G of size k?

Theorem 2.2.2. Independent Set is NP-Complete.

Proof: This readily follows by a reduction from Clique. Given G and k, compute the complement graph G where
we connected two vertices u, v in G iff they are independent (i.e., not connected) in G. See Figure 2.3. Clearly,
a clique in G corresponds to an independent set in G, and vice versa. Thus, Independent Set is NP-Hard, and
since it is in NP, it is NPC.

2.3. Vertex Cover


Definition 2.3.1. For a graph G, a set of vertices S ⊆ V(G) is a vertex cover if it touches every edge of G.
Namely, for every edge uv ∈ E(G) at least one of the endpoints is in S.

Vertex Cover
Instance: A graph G, integer k
Question: Is there a vertex cover in G of size k?

22
Lemma 2.3.2. A set S is a vertex cover in G iff V \ S is an independent set in G.

Proof: If S is a vertex cover, then consider two vertices u, v ∈ V \ S. If uv ∈ E(G) then the edge uv is not covered
by S. A contradiction. Thus V \ S is an independent set in G.
Similarly, if V \ S is an independent set in G, then for any edge uv ∈ E(G) it must be that either u or v are
not in V \ S. Namely, S covers all the edges of G.

Theorem 2.3.3. Vertex Cover is NP-Complete.

Proof: Vertex Cover is in NP as can be easily verified. To show that it NP-Hard we will do a reduction from
Independent Set. So, we are given an instance of Independent Set which is a graph G and parameter k, and we
want to know whether there is an independent set in G of size k. By Lemma 2.3.2, G has an independent set
of k iff it has a vertex cover of size n − k. Thus, feeding G and n − k into (the supposedly given) black box that
can solves vertex cover in polynomial time, we can decide if G has an independent set of size k in polynomial
time. Thus Vertex Cover is NP-Complete.

2.4. Graph Coloring


Definition 2.4.1. A coloring, by c colors, of a graph G = (V, E) is a mapping C : V(G) → {1, 2, . . . , c} such that
every vertex is assigned a color (i.e., an integer), such that no two vertices that share an edge are assigned the
same color.

Usually, we would like to color a graph with a minimum number of colors. Deciding if a graph can be colored
with two colors is equivalent to deciding if a graph is bipartite and can be done in linear time using DFS or
BFS¬ .
Coloring is useful for resource allocation (used in compilers for example) and scheduling type problems.
Surprisingly, moving from two colors to three colors make the problem much harder.

3Colorable
Instance: A graph G.
Question: Is there a coloring of G using three colors?

Theorem 2.4.2. 3Colorable is NP-Complete.

Proof: Clearly, 3Colorable is in NP.


We prove that it is NP-Complete by a reduction from 3SAT. Let F be the given 3SAT instance. The
basic idea of the proof is to use gadgets to transform the formula into a graph. Intuitively, a gadget is a small
component that corresponds to some feature of the input.
The first gadget will be the color generating gadget, which is formed by three special X
vertices connected to each other, where the vertices are denoted by X, F and T, respectively. We
will consider the color used to color T to correspond to the TRUE value, and the color of the F
to correspond to the FALSE value. T F
¬ If you do not know the algorithm for this, please read about it to fill this monstrous gap in your knowledge.

23
a u
a u a u
w r
w r w r
b v
T b v b v
T T
c s
c s c s
(1) a ⇐ F, b ⇐ F, c ⇐ F (2) a ⇐ F, b ⇐ F, c ⇐ T (3) a ⇐ F, b ⇐ T, c ⇐ F
a u a u a u
w r w r w r
b v b v b v
T T T
c s c s c s
(4) a ⇐ F, b ⇐ T, c ⇐ T (5) a ⇐ T, b ⇐ F, c ⇐ F (6) a ⇐ T, b ⇐ F, c ⇐ T
a u a u
w r w r
b v b v
T T
c s c s
(7) a ⇐ F, b ⇐ T, c ⇐ T (8) a ⇐ T, b ⇐ F, c ⇐ F

Figure 2.4: The clause a ∨ b ∨ c and all the three possible colorings to its literals. If all three literals are colored
by the color of the special node F, then there is no valid coloring of this component, see case (1).

For every variable y appearing in F , we will generate a variable gadget, which is (again)
X
a triangle including two new vertices, denoted by y and y, and the third vertex is the auxiliary
vertex X from the color generating gadget. Note, that in a valid 3-coloring of the resulting
graph either y would be colored by T (i.e., it would be assigned the same color as the color as y y
the vertex T) and y would be colored by F, or the other way around. Thus, a valid coloring
could be interpreted as assigning TRUE or FALSE value to each variable y, by just inspecting the color used for
coloring the vertex y.
Finally, for every clause we introduce a clause gadget. See the figure on the a u
right – for how the gadget looks like for the clause a ∨ b ∨ c. Note, that the vertices w r
marked by a, b and c are the corresponding vertices from the corresponding variable b v
T
gadget. We introduce five new variables for every such gadget. The claim is that this
s
gadget can be colored by three colors if and only if the clause is satisfied. This can c
be done by brute force checking all 8 possibilities, and we demonstrate it only for two cases. The reader should
verify that it works also for the other cases.
Indeed, if all three vertices (i.e., three variables in a clause) on the left side of a a u
variable clause are assigned the F color (in a valid coloring of the resulting graph), w r
then the vertices u and v must be either be assigned X and T or T and X, respectively, b v
T
in any valid 3-coloring of this gadget (see figure on the left). As such, the vertex w
c s
must be assigned the color F. But then, the vertex r must be assigned the X color.
But then, the vertex s has three neighbors with all three different colors, and there
is no valid coloring for s.

24
Figure 2.5: The formula (a ∨ b ∨ c) ∧ (b ∨ c ∨ d) ∧ (a ∨ c ∨ d) ∧ (a ∨ b ∨ d) reduces to the depicted graph.

As another example, consider the case when one of the variables on the left a u
is assigned the T color. Then the clause gadget can be colored in a valid way, w r
as demonstrated on the figure on the right. b v
T
This concludes the reduction. Clearly, the generated graph can be computed
in polynomial time. By the above argumentation, if there is a valid 3-coloring c s
of the resulting graph G, then there is a satisfying assignment for F . Similarly,
if there is a satisfying assignment for F then the G be colored in a valid way using three colors. For how the
resulting graph looks like, see Figure 2.5.
This implies that 3Colorable is NP-Complete.

Here is an interesting related problem. You are given a graph G as input, and you know that it is 3-colorable.
In polynomial time, what is the minimum number of colors you can use to color this graph legally? Currently,
the best polynomial time algorithm for coloring such graphs, uses O n3/14 colors.

Chapter 3

NP Completeness III

3.1. Hamiltonian Cycle


Definition 3.1.1. A Hamiltonian cycle is a cycle in the graph that visits every vertex exactly once.

Definition 3.1.2. An Eulerian cycle is a cycle in a graph that uses every edge exactly once.

Finding Eulerian cycle can be done in linear time. Surprisingly, finding a Hamiltonian cycle is much harder.

25
Hamiltonian Cycle
Instance: A graph G.
Question: Is there a Hamiltonian cycle in G?

Theorem 3.1.3. Hamiltonian Cycle is NP-Complete.

Proof: Hamiltonian Cycle is clearly in NP.


We will show a reduction from Vertex Cover. Given a
graph G and integer k we redraw G in the following way: a
b
We turn every vertex into a horizontal line segment, all of
c
the same length. Next, we turn an edge in the original graph
d
G into a gate, which is a vertical segment connecting the two
e
relevant vertices.
Note, that there is a Vertex Cover in G of size k if and only if there are k horizontal lines that stabs all the
gates in the resulting graph H (a line stabs a gate if one of the endpoints of the gate lies on the line).
a
Thus, computing a vertex cover in G is equivalent to computing k disjoints
b
paths through the graph G that visits all the gates. However, there is a technical
c
problem: a path might change venues or even go back. See figure on the right.
d
(u,v,1) (u,v,2) (u,v,3) (u,v,4) (u,v,5) (u,v,6)

u
To overcome this problem, we will replace each e gate with a component that
guarantees, that if you visit all its vertices, you have to go forward and can
NOT go back (or change “lanes”). The new component is depicted on the left.
There only three possible ways to visit all the vertices of the components by
v
(v,u,1) (v,u,2) (v,u,3) (v,u,4) (v,u,5) (v,u,6) paths that do not start/end inside the component, and they are the following:

The proof that this is the only three possibilities is by brute force. De-
picted on the right is one impossible path, that tries to backtrack by entering
on the top and leaving on the bottom. Observe, that there are vertices left
unvisited. Which means that not all the vertices in the graph are going to be
visited, because we add the constraint, that the paths start/end outside the
gate-component (this condition would be enforced naturally by our final construction).
The resulting graph H1 for the example graph we started with is depicted
on the right. There exists a Vertex Cover in the original graph iff there exists k
paths that start on the left side and end on the right side, in this weird graph.
And these k paths visits all the vertices.
The final stroke is to add connection from the left side to the right side, such that once a a

you arrive to the right side, you can go back to the left side. However, we want connection b
b
that allow you to travel exactly k times. This is done by adding to the above graph a
“routing box” component H2 depicted on the right, with k new middle vertices. The ith c c

vertex on the left of the routing component is the left most vertex of the ith horizontal line d d
in the graph, and the ith vertex on the right of the component is the right most vertex of
the ith horizontal line in the graph. e e

It is now easy (but tedious) to verify that the resulting graph H1 ∪ H2 has a Hamiltonian path iff H1 has k
paths going from left to right, which happens, iff the original graph has a Vertex Cover of size k. It is easy to
verify that this reduction can be done in polynomial time.

26
3.2. Traveling Salesman Problem
A traveling salesman tour, is a Hamiltonian cycle in a graph, which its price is the price of all the edges it uses.
TSP
Instance: G = (V, E) a complete graph - n vertices, c(e): Integer cost function over the edges of G,
and k an integer.
Question: Is there a traveling-salesman tour with cost at most k?

Theorem 3.2.1. TSP is NP-Complete.


Proof: Reduction from Hamiltonian cycle. Consider a graph G = (V, E), and let H be the complete graph
defined over V. Let (
1 e ∈ E(G)
c(e) =
2 e < E(G).
Clearly, the cheapest TSP in H with cost function equal to n iff G is Hamiltonian. Indeed, if G is not
Hamiltonian, then the TSP must use one edge that does not belong to G, and then, its price would be at least
n + 1.

3.3. Subset Sum


We would like to prove that the following problem, Subset Sum is NPC.
Subset Sum
Instance: S - set of positive integers,t: - an integer number (Target)
Question: Is there a subset X ⊆ S such that x ∈X x = t?
Í

How does one prove that a problem is NP-Complete? First, one has to choose an appropriate NPC to
reduce from. In this case, we will use 3SAT. Namely, we are given a 3CNF formula with n variables and m
clauses. The second stage, is to “play” with the problem and understand what kind of constraints can be
encoded in an instance of a given problem and understand the general structure of the problem.
The first observation is that we can use very long numbers as input to Subset Sum. The numbers can be of
polynomial length in the size of the input 3SAT formula F.
The second observation is that in fact, instead of thinking about Subset Sum as adding numbers, we can
think about it as a problem where we are given vectors with k components each, and the sum of the vectors
(coordinate by coordinate, must match. For example, the input might be the vectors (1, 2), (3, 4), (5, 6) and the
target vector might be (6, 8). Clearly, (1, 2) + (5, 6) give the required target vector. Lets refer to this new problem
as Vec Subset Sum.
Vec Subset Sum
Instance: S - set of n vectors of dimension k, each vector has non-negative numbers for its coordinates,


and a target vector t .
Question: Is there a subset X ⊆ S such that →
Í →− → −
−x ∈X x = t ?

Given an instance of Vec Subset Sum, we can covert it into an instance of Subset Sum as follows: We
compute the largest number in the given instance, multiply it by n2 · k · 100, and compute how many digits are
required to write this number down. Let U be this number of digits. Now, we take every vector in the given
instance and we write it down using U digits, padding it with zeroes as necessary. Clearly, each vector is now
converted into a huge integer number. The property is now that a sub of numbers in a specific column of the
given instance can not spill into digits allocated for a different column since there are enough zeroes separating
the digits corresponding to two different columns.

27
Next, let us observe that we can force the solution (if it exists) for Vec Target ?? ?? 01 ???
Subset Sum to include exactly one vector out of two vectors. To this end, we a1 ?? ?? 01 ??
will introduce a new coordinate (i.e., a new column in the table on the right) a2 ?? ?? 01 ??
for all the vectors. The two vectors a1 and a2 will have 1 in this coordinate,
and all other vectors will have zero in this coordinate. Finally, we set this coordinate in the target vector to
be 1. Clearly, a solution is a subset of vectors that in this coordinate add up to 1. Namely, we have to choose
either a1 or a2 into our solution.
In particular, for each variable x appearing in F, we will introduce two rows, denoted by x and x and
introduce the above mechanism to force choosing either x or x to the optimal solution. If x (resp. x) is chosen
into the solution, we will interpret it as the solution to F assigns TRUE (resp. FALSE) to x.
Next, consider a clause C ≡ a ∨ b ∨ c.appearing in F. This clause numbers ... C ≡ a ∨ b ∨ c ...
requires that we choose at least one row from the rows corresponding a ... 01 ...
to a, b to c. This can be enforced by introducing a new coordinate a ... 00 ...
for the clauses C, and setting 1 for each row that if it is picked then b ... 01 ...
the clauses is satisfied. The question now is what do we set the target
b ... 00 ...
to be? Since a valid solution might have any number between 1 to
c ... 00 ...
3 as a sum of this coordinate. To overcome this, we introduce three
c ... 01 ...
new dummy rows, that store in this coordinate, the numbers 7, 8
C fix-up 1 000 07 000
and 9, and we set this coordinate in the target to be 10. Clearly, if
C fix-up 2 000 08 000
we pick to dummy rows into the optimal solution then sum in this
C fix-up 3 000 09 000
coordinate would exceed 10. Similarly, if we do not pick one of these
three dummy rows to the optimal solution, the maximum sum in this TARGET 10
coordinate would be 1 + 1 + 1 = 3, which is smaller than 10. Thus, the only possibility is to pick one dummy row,
and some subset of the rows such that the sum is 10. Notice, this “gadget” can accommodate any (non-empty)
subset of the three rows chosen for a, b and c.
We repeat this process for each clause of F. We end up with a set U of 2n + 3m vectors with n + m coordinate,
and the question if there is a subset of these vectors that add up to the target vector. There is such a subset
if and only if the original formula F is satisfiable, as can be easily verified. Furthermore, this reduction can be
done in polynomial time.
Finally, we convert these vectors into an instance of Subset Sum. Clearly, this instance of Subset Sum has a
solution if and only if the original instance of 3SAT had a solution. Since Subset Sum is in NP as an be easily
verified, we conclude that that Subset Sum is NP-Complete.

Theorem 3.3.1. Subset Sum is NP-Complete.

For a concrete example of the reduction, see Figure 3.1.

3.4. 3 dimensional Matching (3DM)

3DM
Instance: X,Y, Z sets of n elements, and T a set of triples, such that (a, b, c) ∈ T ⊆ X × Y × Z.
Question: Is there a subset S ⊆ T of n disjoint triples, s.t. every element of X ∪ Y ∪ Z is covered
exactly once.?

Theorem 3.4.1. 3DM is NP-Complete.

The proof is long and tedious and is omitted.


BTW, 2DM is polynomial (later in the course?).

28
numbers
numbers a∨a b∨b c∨c d∨d D ≡ b∨c∨d C ≡ a∨b∨c
010000000001
a 1 0 0 0 00 01
010000000000
a 1 0 0 0 00 00
000100000001
b 0 1 0 0 00 01
000100000100
b 0 1 0 0 01 00 000001000100
c 0 0 1 0 01 00 000001000001
c 0 0 1 0 00 01 000000010000
d 0 0 0 1 00 00 000000010101
d 0 0 0 1 01 01 000000000007
C fix-up 1 0 0 0 0 00 07 000000000008
C fix-up 2 0 0 0 0 00 08 000000000009
C fix-up 3 0 0 0 0 00 09 000000000700
D fix-up 1 0 0 0 0 07 00 000000000800
D fix-up 2 0 0 0 0 08 00 000000000900
D fix-up 3 0 0 0 0 09 00
010101011010
TARGET 1 1 1 1 10 10
 
Figure 3.1: The Vec Subset Sum instance generated for the 3SAT formula F = b ∨ c ∨ d ∧ (a ∨ b ∨ c) is shown
on the left. On the right side is the resulting instance of Subset Sum.

3.5. Partition

Partition
Instance: A set S of n numbers.
Question: Is there a subset T ⊆ S s.t. t ∈T t = s ∈S\T s.?
Í Í

Theorem 3.5.1. Partition is NP-Complete.

Proof: Partition is in NP, as we can easily verify that such a partition is valid.
Reduction from Subset Sum. Let the given instance be n numbers a1, . . . , an and a target number t. Let
S = i= ai , and set an+1 = 3S − t and an+2 = 3S − (S − t) = 2S + t. It is easy to verify that there is a solution to
Ín
the given instance of subset sum, iff there is a solution to the following instance of partition:
a1, . . . , an, an+1, an+2 .
Clearly, Partition is in NP and thus it is NP-Complete.

3.6. Some other problems


It is not hard to show that the following problems are NP-Complete:

SET COVER
Instance: (S, F, k):
S: A set of n elements
F: A family of subsets of S, s.t. X ∈F X = S.
Ð
k: A positive integer.
Question: Are there k sets S1, . . . , Sk ∈ F that cover S. Formally, i Si = S?
Ð

29
Part II
Dynamic programming

Chapter 4

Dynamic programming

The events of 8 September prompted Foch to draft the later legendary signal: “My centre is giving way, my right is
in retreat, situation excellent. I attack.” It was probably never sent.
– – The first world war, John Keegan..

4.1. Basic Idea - Partition Number


Definition 4.1.1. For a positive integer n, the partition number of n, denoted by p(n), is the number of different
ways to represent n as a decreasing sum of positive integers.

The different number of partitions of 6 are shown 6=6


on the right. 6=5+1
It is natural to ask how to compute p(n). The 6=4+2 6=4+1+1
“trick” is to think about a recursive solution and ob- 6=3+3 6=3+2+1 6+3+1+1+1
serve that once we decide what is the leading num- 6=2+2+2 6=2+2+1+1 6=2+1+1+1+1
ber d, we can solve the problem recursively on the 6=1+1+1+1+1+1
remaining budget n − d under the constraint that no
number exceeds d..

Suggestion 4.1.2. Recursive algorithms are one of the main tools in developing algorithms (and writing pro- T IP
grams). If you do not feel comfortable with recursive algorithms you should spend time playing with recursive
algorithms till you feel comfortable using them. Without the ability to think recursively, this class would be a
long and painful torture to you. Speak with me if you need guidance on this topic.

30
The resulting algorithm is depicted on the right. We PartitionsI(num, d) //d-max digit
are interested in analyzing its running time. To this end, if (num ≤ 1) or (d = 1)
draw the recursion tree of Partitions and observe that return 1
the amount of work spend at each node, is proportional if d > num
to the number of children it has. Thus, the overall time d ← num
spend by the algorithm is proportional to the size of res ← 0
the recurrence tree, which is proportional (since every for i ← d down to 1
node is either a leaf or has at least two children) to the res = res + PartitionsI(num − i,i)
number of leafs in the tree, which is Θ(p(n)). return res
√ This is not very exciting, since it is easy verify that
3 n/4 ≤ p(n) ≤ nn . Partitions(n)
return PartitionsI(n, n)
Exercise 4.1.3. Prove the above bounds on p(n) (or better bounds).

Suggestion 4.1.4. Exercises in the class notes are a natural easy questions for inclusions in exams. You probably
T IP
want to spend time doing them.

e π 2n/3
Hardy and Ramanujan (in 1918) showed that p(n) ≈ √ (which I am sure was your first guess).
4n 3
It is natural to ask, if there is a faster algorithm. Or more specifically, why is the algorithm Partitions
so slowwwwwwwwwwwwwwwwww? The answer is that during the computation of Partitions(n) the function
PartitionsI(num, max_digit) is called a lot of times with the same parameters.
An easy way to overcome this problem is cache the PartitionsI_C(num, max_digit)
results of PartitionsI using a hash table.¬ Whenever if (num ≤ 1) or (max_digit = 1)
PartitionsI is being called, it checks in a cache table if return 1
it already computed the value of the function for this if max_digit > num
parameters, and if so it returns the result. Otherwise, it d ← num
computes the value of the function and before returning if hnum, max_digiti in cache
the value, it stores it in the cache. This simple (but return cache(hnum, max_digiti)
powerful) idea is known as memoization. res ← 0
What is the running time of PartitionS_C? Analyz- for i ← max_digit down to 1 do
ing recursive algorithm that have been transformed by res += PartitionsI_C(num − i,i)
memoization are usually analyzed as follows: (i) bound cache(hnum, max_digiti) ← res
the number of values stored in the hash table, and (ii) return res
bound the amount of work involved in storing one value
into the hash table (ignoring recursive calls). PartitionS_C(n)
Here is the argument in this case: return PartitionsI_C(n, n)
(A) If a call to PartitionsI_C takes (by itself) more than constant time, then this call performs a store in the
cache.
(B) Number of store operations in the cache is O(n2 ), since this is the number of different entries stored in the
cache. Indeed, for PartitionsI_C(num, max_digit), the parameters num and max_digit are both integers
in the range 1, . . . , n.
(C) We charge the work in the loop to the resulting store. The work in the loop is at most O(n) time (since
max_digit ≤ n).
(D) As such, the overall running time of PartitionS_C(n) is O n2 × O(n) = O n3 .
 

Note, that this analysis is naive but it would be sufficient


for our purposes (verify that the bound of O(n3 ) on the
running time is tight in this case).
¬ Throughout the course, we will assume that a hash table operation can be done in constant time. This is a reasonable
assumption using randomization and perfect hashing.

31
4.1.1. A Short sermon on memoization
This idea of memoization is generic and nevertheless very useful. To recap, it works by taking a recursive
function and caching the results as the computations goes on. Before trying to compute a value, check if it was
already computed and if it is already stored in the cache. If so, return result from the cache. If it is not in the
cache, compute it and store it in the cache (for the time being, you can think about the cache as being a hash
table).
• When does it work: There is a lot of inefficiency in the computation of the recursive function because
the same call is being performed repeatedly.
• When it does NOT work:
(A) The number of different recursive function calls (i.e., the different values of the parameters in the
recursive call) is “large”.
(B) When the function has side effects.

Tidbit 4.1.5. Some functional programming languages allow one to take a recursive function f (·) that you already
tidbit
implemented and give you a memorized version f 0(·) of this function without the programmer doing any extra
work. For a nice description of how to implement it in Scheme see [ASS96].

It is natural to ask if we can do better than just using caching? As usual in life – more pain, more gain.
Indeed, in a lot of cases we can analyze the recursive calls, and store them directly in an (sometime multi-
dimensional) array. This gets rid of the recursion (which used to be an important thing long time ago when
memory, used by the stack, was a truly limited resource, but it is less important nowadays) which usually yields
a slight improvement in performance in the real world.
This technique is known as dynamic programming ­ . We can sometime save space and improve running
time in dynamic programming over memoization.

Dynamic programming made easy:


(A) Solve the problem using recursion - easy (?).
(B) Modify the recursive program so that it caches the results.
(C) Dynamic programming: Modify the cache into an array.

4.2. Example – Fibonacci numbers


Let us revisit the classical problem of computing Fibonacci numbers.

4.2.1. Why, where, and when?


To remind the reader, in the Fibonacci sequence, the first two numbers F0 = 0 and F1 = 1, and Fi = Fi−1 + Fi−2 ,
for i > 1. This sequence was discovered independently in several places and times. From Wikipedia:

“The Fibonacci sequence appears in Indian mathematics, in connection with Sanskrit prosody.
In the Sanskrit oral tradition, there was much emphasis on how long (L) syllables mix with the
short (S), and counting the different patterns of L and S within a given fixed length results in the
Fibonacci numbers; the number of patterns that are m short syllables long is the Fibonacci number
Fm+1 .”

(To see that, imagine that a long syllable is equivalent in length to two short syllables.) Surprisingly, the credit
for this formalization goes back more than 2000 years (!)
­ As usual in life, it is not dynamic, it is not programming, and its hardly a technique. To overcome this, most texts find creative
ways to present this topic in the most opaque way possible.

32
FibDP(n)
if n ≤ 1
return 1
if F[n] initialized
return F[n]
F[n] ⇐=FibDP(n − 1)+FibDP(n − 2)
return F[n]
Figure 4.1

Fibonacci was a decent mathematician (1170—1250 AD), and his most significant and lasting contribution
was spreading the Hindu-Arabic numerical system (i.e., zero) in Europe. He was the son of a rich merchant that
spend much time growing up in Algiers, where he learned the decimal notation system. He traveled throughout
the Mediterranean world to study mathematics. When he came back to Italy he published a sequence of books
(the first one “Liber Abaci” contained the description of the decimal notations system). In this book, he also
posed the following problem:
Consider a rabbit population, assuming that: A newly born pair of rabbits, one male, one female,
are put in a field; rabbits are able to mate at the age of one month so that at the end of its second
month a female can produce another pair of rabbits; rabbits never die and a mating pair always
produces one new pair (one male, one female) every month from the second month on. The puzzle
that Fibonacci posed was: how many pairs will there be in one year?
(The above is largely based on Wikipedia.)

4.2.2. Computing Fibonacci numbers

The recursive function for computing Fibonacci numbers FibR(n)


is depicted on the right. As before, the running time of if n = 0
FibR(n) is proportional to O(Fn ), where Fn is the nth Fi- return 1
bonacci number. It is known that if n = 1
" √ !n √ !n# return 1
1 1+ 5 1− 5
Fn = √ + = Θ(φn ), return FibR(n − 1) + FibR(n − 2)
5 2 2

where φ = 1+2 5 .
We can now use memoization, and with a bit of care, it is easy enough to come up with the dynamic
programming version of this procedure, see FibDP in Figure 4.1. Clearly, the running time of FibDP(n) is
linear (i.e., O(n)).
A careful inspection of FibDP exposes the fact that it fills the array F[...] from left to right. In particular,
it only requires the last two numbers in the array.
As such, we can get rid of the array all together, and reduce space needed FibI(n)
to O(1): This is a phenomena that is quite common in dynamic programming: prev ← 0, curr ← 1
By carefully inspecting the way the array/table is being filled, sometime one for i = 1 to n do
can save space by being careful about the implementation. next ← curr + prev
The running time of FibI is identical to the running time of FibDP. Can prev ← curr
we do better? curr ← next
Surprisingly, the answer is yes, to this end observe that return curr
x
    
y 0 1
= .
x+y 1 1 y

33
h a r – p e l e d
s h a r p <space> e y e d
1 0 0 0 1 0 1 0 1 0 0

– l
Insert: delete: replace:
s y
e
(still) insert: ignore:
<space> e

Figure 4.2: Interpreting edit-distance as a alignment task. Aligning identical characters to each other is free of
cost. The price in the above example is 4. There are other ways to get the same edit-distance in this case.

As such,
2   n−3 
Fn−1 Fn−2 Fn−3 F2
        
0 1 0 1 0 1
= = = .
Fn 1 1 Fn−1 1 1 Fn−2 1 1 F1
Thus, computing the nth Fibonacci number can be done by computing
  n−3
0 1
.
1 1
How to this quickly? Well, we know that a∗b∗c = (a∗b)∗c = FastExp(a, n)
a∗(b∗c)® , as such one can compute a n by repeated squaring, if n = 0 then
see pseudo-code on the right. The running time of FastExp return 1
is O(log n) as can be easily verified. Thus, we can compute in if n = 1 then
Fn in O(log n) time. return a
But, something is very strange. Observe that Fn has ≈ if n is even then
log10 1.68... n = Θ(n) digits. How can we compute a number return (FastE xp(a, n/2))2
that is that large in logarithmic time? Well, we assumed else
that the time to handle a number is O(1) independent of its 2
return a ∗ FastExp a, n−1
2
size. This is not true in practice if the numbers are large.
Naturally, one has to be very careful with such assumptions.

4.3. Edit Distance


We are given two strings A and B, and we want to know how close the two strings are too each other. Namely,
how many edit operations one has to make to turn the string A into B?
We allow the following operations: (i) insert a character, (ii) delete a character, and (iii) replace a character
by a different character. Price of each operation is one unit.
For example, consider the strings A =“har-peled” and B =“sharp eyed”. Their edit distance is 4, as can be
easily seen.
® Associativity of multiplication...

34
But how do we compute the edit-distance (min ed(A[1..m], B[1..n])
# of edit operations needed)? if m = 0 return n
The idea is to list the edit operations from left to if n = 0 return m
right. Then edit distance turns into a an alignment pinsert = ed(A[1..m], B[1..(n − 1)]) + 1
problem. See Figure 4.2. pdelete = ed(A[1..(m − 1)], B[1..n]) + 1
In particular, the idea of the recursive algorithm pr/i = ed(A[1..(m
is to inspect the last character and decide which of  − 1)], B[1..(n
 − 1)] )
+ A[m] , B[n]
the categories it falls into: insert, delete or ignore. return min pinsert , pdelete, pr eplace/ignor e

See pseudo-code on the right.
The running time of ed(...)? Clearly exponential, and roughly 2n+m , where n + m is the size of the input.
So how many different recursive calls ed performs? Only:O(m ∗ n) different calls, since the only parameters
that matter are n and m.
So the natural thing is to introduce edM(A[1..m], B[1..n])
memoization. The resulting algorithm edM if m = 0 return n
is depicted on the right. The running if n = 0 return m
time of edM(n, m) when executed on two if T[m, n] is initialized then return T[m, n]
strings of length n and m respective is pinsert = edM(A[1..m], B[1..(n − 1)]) + 1
O(nm), since there are O(nm) store opera- pdelete = edM(A[1..(m − 1)], B[1..n]) + 1 
tions in the cache, and each store requires pr/i = edM(A[1..(m − 1)], B[1..(n − 1)]) + A[m] , B[n]

O(1) time (by charging one for each recur- T[m, n] ← min pinsert , pdelete, pr eplace/ignor e

sive call). Looking on the entry T[i, j] in return T[m, n]
the table, we realize that it depends only
on T[i − 1, j], T[i, j − 1] and T[i − 1, j − 1]. Thus, instead of recursive algorithm, we can fill the table T row by
row, from left to right.
edDP(A[1..m], B[1..n]) The dynamic programming version that
for i = 1 to m do T[i, 0] ← i uses a two dimensional array is pretty sim-
for j = 1 to n do T[0, j] ← j ple now to derive and is depicted on the
for i ← 1 to m do left. Clearly, it requires O(nm) time, and
for j ← 1 to n do O(nm) space. See the pseudo-code of the
pinsert = T[i, j − 1] + 1 resulting algorithm edDP on the left.
pdelete = T[i − 1, j] + 1 It is enlightening to think about the
pr/ignor e = T[i − 1. j − 1] + A[i] , B[ j]
  algorithm as computing for each T[i, j] the
T[i, j] ← min pinsert , pdelete, pr/ignor e
 cell it got the value from. What you get is
return T[m, n] a tree encoded in the table. See Figure 4.3.
It is now easy to extract from the table the
sequence of edit operations that realizes the minimum edit distance between A and B. Indeed, we start a walk
on this graph from the node corresponding to T[n, m]. Every time we walk left, it corresponds to a deletion,
every time we go up, it corresponds to an insertion, and going sideways corresponds to either replace/ignore.
Note, that when computing the ith row of T[i, j], we only need to know the value of the cell to the left of
the current cell, and two cells in the row above the current cell. It is thus easy to verify that the algorithm
needs only the remember the current and previous row to compute the edit distance. We conclude:

Theorem 4.3.1. Given two strings A and B of length n and m, respectively, one can compute their edit distance
in O(nm). This uses O(nm) space if we want to extract the sequence of edit operations, and O(n + m) space if we
only want to output the price of the edit distance.

Exercise 4.3.2. Show how to compute the sequence of edit-distance operations realizing the edit distance using
only O(n + m) space and O(nm) running time. (Hint: Use a recursive algorithm, and argue that the recursive
call is always on a matrix which is of size, roughly, half of the input matrix.)

35
A L G O R I T H M
0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7 ← 8 ← 9
↑ v
A 1 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7 ← 8
↑ ↑ v
L 2 1 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7
↑ ↑ ↑ v -
T 3 2 1 1 ← 2 ← 3 ← 4 4 ← 5 ← 6
↑ ↑ ↑ ↑ v -
R 4 3 2 2 2 2 ← 3 ← 4 ← 5 ← 6
↑ ↑ ↑ - - v -
U 5 4 3 3 3 3 3 ← 4 ← 5 ← 6
↑ ↑ ↑ - - - v
I 6 5 4 4 4 4 3 ← 4 ← 5 ← 6
↑ ↑ ↑ ↑ ↑ ↑ ⇑
S 7 6 5 5 5 5 4 ← 4 ← 5 ← 6
↑ ↑ ↑ ↑ ↑ ↑ ↑ v
T 8 7 6 6 6 6 5 4 ← 5 ← 6
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ v
I 9 8 7 7 7 7 6 5 5 ← 6
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ v
C 10 9 8 8 8 8 7 6 6 6

Figure 4.3: Extracting the edit operations from the table.

4.3.1. Shortest path in a DAG and dynamic programming


Given a dynamic programming problem and its associated recursive program, one can consider all the different
possible recursive calls, as configurations. We can create graph, every configuration is a node, and an edge is
introduced between two configurations if one configuration is computed from another configuration, and we put
the additional price that might be involved in moving between the two configurations on the edge connecting
them. As such, for the edit distance, we have directed edges from the vertex (i, j) to (i, j − 1) and (i − 1, j) both
with weight 1 on them. Also, we have an edge between (i, j) to (i − 1, j − 1) which is of weight 0 if A[i] = B[ j]
and 1 otherwise. Clearly, in the resulting graph, we are asking for the shortest path between (n, m) and (0, 0).
And here are where things gets interesting. The resulting graph G is a DAG (directed acyclic graph ¯ ).
DAG can be interpreted as a partial ordering of the vertices, and by topological sort on the graph (which takes
linear time), one can get a full ordering of the vertices which agrees with the DAG. Using this ordering, one can
compute the shortest path in a DAG in linear time (in the size of the DAG). For edit-distance the DAG size is
O(nm), and as such this algorithm takes O(nm) time.
This interpretation of dynamic programming as a shortest path problem in a DAG is a useful way of thinking
about it, and works for many dynamic programming problems.
More surprisingly, one can also compute the longest path in a DAG in linear time. Even for negative weighted
edges. This is also sometime a problem that solving it is equivalent to dynamic programming.
¯ No cycles in the graph – its a miracle!

36
Chapter 5

Dynamic programming II - The Recursion


Strikes Back

“No, mademoiselle, I don’t capture elephants. I content myself with living among them. I like them. I like looking at
them, listening to them, watching them on the horizon. To tell you the truth, I’d give anything to become an elephant
myself. That’ll convince you that I’ve nothing against the Germans in particular: they’re just men to me, and that’s
enough.”
– – The roots of heaven, Romain Gary.

5.1. Optimal search trees


Given a binary search tree T, the time to search for an element x, that is stored in T, is O(1 + depth(T, x)),
where depth(T, x) denotes the depth of x in T (i.e., this is the length of the path connecting x with the root of
T).

Problem 5.1.1. Given a set of n (sorted) keys A[1 . . . n], build the 12 32
best binary search tree for the elements of A.
4 32 12 45
Note, that we store the values in the internal node of the binary
trees. The figure on the right shows two possible search trees for the
21 45 4 21
same set of numbers. Clearly, if we are accessing the number 12 all
the time, the tree on the left would be better to use than the tree on Two possible search trees for the set
the right. A = [4, 12, 21, 32, 45].
Usually, we just build a balanced binary tree, and this is good
enough. But assume that we have additional information about what is the frequency in which we access the
element A[i], for i = 1, . . . , n. Namely, we know that A[i] is going be accessed f [i] times, for i = 1, . . . , n.
Õn
In this case, we know that the total search time for a tree T is S(T) = (depth(T, i) + 1)f [i], where depth(T, i)
i=1
is the depth of the node in T storing the value A[i]. Assume that A[r] is the value stored in the root of the tree
T. Clearly, all the values smaller than A[r] are in the subtree leftT , and all values larger than r are in rightT .
Thus, the total search time, in this case, is
price of access to root
z }| {
r−1
Õ Õn n
Õ
S(T) = (depth(leftT , i) + 1)f [i] + f [i] + (depth(rightT , i) + 1)f [i].
i=1 i=1 i=r+1
Observe, that if T is the optimal search tree for the access frequencies f [1], . . . , f [n], then the subtree leftT
must be optimal for the elements accessing it (i.e., A[1 . . . r − 1] where r is the root).
Thus, the price of T is
Õn
S(T) = S(leftT ) + S(rightT ) + f [i],
i=1
where S(Q) is the price of searching in Q for the frequency of elements stored in Q.

37
Figure 5.1: A polygon and two possible triangulations of the polygon.

This recursive formula naturally gives rise


CompBestTreeI (A[i . . . j], f [i . . . j] )
to a recursive algorithm, which is depicted on
for r = i . . . j do
the right. The naive implementation requires
Tle f t ← CompBestTreeI(A[i . . . r − 1], f [i . . . r − 1])
O(n2 ) time (ignoring the recursive call). But
Tright ← CompBestTreeI(A[r + 1 . . . j], f [r + 1 . . . j])
in fact, by a more careful implementation, to-
Tr ← Tree Tle f t , A[r] , Tright
gether with the tree T, we can also return the
Pr ← S(Tr )
price of searching on this tree with the given fre-
quencies. Thus, this modified algorithm. Thus,
return cheapest tree out of Ti , . . . ,Tj .
the running time for this function takes O(n)
time (ignoring recursive calls). The running
CompBestTree (A[1 . . . n], f [1 . . . n] )
time of the resulting algorithm is
return CompBestTreeI( A[1 . . . n], f [1 . . . n])
n−1
Õ
α(n) = O(n) + (α(i) + α(n − i − 1)),
i=0

and the solution of this recurrence is O(n3n ).


We can, of course, improve the running time using memoization. There are only O(n2 ) different recursive
calls, and as such, the running time of CompBestTreeMemoize is O(n2 ) · O(n) = O(n3 ).
Theorem 5.1.2. One can compute the optimal binary search tree in O n3 time using O n2 space.
 

A further improvement raises from the fact that the root location is “monotone”. Formally, if R[i, j] denotes
the location of the element stored in the root for the elements A[i . . . j] then it holds that R[i, j − 1] ≤ R[i, j] ≤
R[i, j +1]. This limits the search space, and we can be more efficient in the search. This leads to O n2 algorithm.
Details are in Jeff Erickson class notes.

5.2. Optimal Triangulations


Given a convex polygon P in the plane, we would like to find the triangulation of P of minimum total length.
Namely, the total length of the diagonals of the triangulation of P, plus the (length of the) perimeter of P are
minimized. See Figure 5.1.
Definition 5.2.1. A set S ⊆ Rd is convex if for any to x, y ∈ S, the segment x y is contained in S.
A convex polygon is a closed cycle of segments, with no vertex pointing inward. Formally, it is a simple
closed polygonal curve which encloses a convex set.
A diagonal is a line segment connecting two vertices of a polygon which are not adjacent. A triangulation
is a partition of a convex polygon into (interior) disjoint triangles using diagonals.

Observation 5.2.2. Any triangulation of a convex polygon with n vertices is made out of exactly n−2 triangles.

38
Our purpose is to find the triangulation of P that has the minimum
total length. Namely, the total length of diagonals used in the triangu-
lation is minimized. We would like to compute the optimal triangulation
using divide and conquer. As the figure on the right demonstrate, there
is always a triangle in the triangulation, that breaks the polygon into
two polygons. Thus, we can try and guess such a triangle in the optimal
triangulation, and recurse on the two polygons such created. The only
difficulty, is to do this in such a way that the recursive subproblems can
be described in succinct way.
4
5 To this end, we assume that the polygon is specified as list of vertices 1 . . . n
in a clockwise ordering. Namely, the input is a list of the vertices of the polygon,
3
for every vertex, the two coordinates are specified. The key observation, is that
in any triangulation of P, there exist a triangle that uses the edge between
6 vertex 1 and n (red edge in figure on the left).
2
In particular, removing the triangle using the edge 1 − n leaves us with two
7
polygons which their vertices are consecutive along the original polygon.

8 Let M[i, j] denote the price of triangulating a polygon starting at vertex i


1
and ending at vertex j, where every diagonal used contributes its length twice
to this quantity, and the perimeter edges contribute their length exactly once. We have the following “natural”
recurrence:


 0 j ≤i


M[i, j] = 0 j =i+1 .

 mini<k< j (∆(i, j, k) + M[i, k] + M[k, j]) Otherwise


q
Where Dist(i, j) = (x[i] − x[ j])2 + (y[i] − y[ j])2 and ∆(i, j, k) = Dist(i, j) + Dist( j, k) + Dist(i, k), where the ith
point has coordinates (x[i], y[i]), for i = 1, . . . , n. Note, that the quantity we are interested in is M[1, n], since it
the triangulation of P with minimum total weight.
Using dynamic programming (or just memoization), we get an algorithm that computes optimal triangula-
tion in O(n3 ) time using O(n2 ) space.

5.3. Matrix Multiplication


We are given two matrix: (i) A is a matrix with dimensions p × q (i.e., p rows and q columns) and (ii) B is a
matrix of size q × r. The product matrix AB, with dimensions p × r, can be computed in O(pqr) time using the
standard algorithm.
A 1000 × 2 Things becomes considerably more interesting when we have to multiply a chain for matri-
B 2 × 1000ces. Consider for example the three matrices A, B and C with dimensions as listed on the left.
C 1000 × 2Computing the matrix ABC = A(BC) requires 2·1000·2+1000·2·2 = 8, 000 operations. On the
other hand, computing the same matrix using (AB)C requires 1000 · 2 · 1000 + 1000 · 1000 · 2 =
4, 000, 000. Note, that matrix multiplication is associative, and as such (AB)C = A(BC).
Thus, given a chain of matrices that we need to multiply, the exact ordering in which we do the multiplication
matters as far to multiply the order is important as far as efficiency.

Problem 5.3.1. The input is n matrices M1, . . . , Mn such that Mi is of size D[i − 1] × D[i] (i.e., Mi has D[i − 1]
rows and D[i] columns), where D[0 . . . n] is array specifying the sizes. Find the ordering of multiplications to
compute M1 · M2 · · · Mn−1 · Mn most efficiently.

39
Again, let us define a recurrence for this problem, where M[i, j] is the amount of work involved in computing
the product of the matrices Mi · · · M j . We have


 0 j =i


M[i, j] = D[i − 1] · D[i] · D[i + 1] j =i+1

 mini ≤k< j (M[i, k] + M[k + 1, j] + D[i − 1] · D[k] · D[ j])

j > i + 1.

Again, using memoization (or dynamic programming), one can compute M[1, n], in O(n3 ) time, using O(n2 )
space.

5.4. Longest Ascending Subsequence


Given an array of numbers A[1 . . . n] we are interested in finding the longest ascending subsequence. For
example, if A = [6, 3, 2, 5, 1, 12] the longest ascending subsequence is 2, 5, 12. To this end, let M[i] denote longest
increasing subsequence having A[i] as the last element in the subsequence. The recurrence on the maximum
possible length, is
1 n=1



M[n] =
 1 + 1≤k<n,max
 M[k] otherwise.
 A[k]< A[n]

The length of the longest increasing subsequence is maxi=1 n M[i]. Again, using dynamic programming, we

get an algorithm with running time O(n2 ) for this problem. It is also not hard to modify the algorithm so that
it outputs this sequence (you should figure out the details of this modification). A better O(n log n) solution is
possible using some data-structure magic.

5.5. Pattern Matching


Tidbit 5.5.1. Magna Carta or Magna Charta - the great charter that King John of England was forced by the
English barons to grant at Runnymede, June 15, 1215, traditionally interpreted as guaranteeing certain civil tidbit
and political liberties.

Assume you have a string S = ”Magna Carta” and a pattern P = ”?ag ∗ at ∗ a ∗ ” where “?” can match a
single character, and “*” can match any substring. You would like to decide if the pattern matches the string.
We are interested in solving this problem using dynamic programming. This is not too hard since this is
similar to the edit-distance problem that was already covered.

IsMatch(S[1 . . . n], P[1 . . . m])


if m = 0 and n = 0 then return TRUE. The resulting code is depicted on the left,
if m = 0 then return FALSE. and as you can see this is pretty tedious.
if n = 0 then Now, use memoization together with this
if P[1 . . . m] is all stars then return TRUE recursive code, and you  get an algorithm
else return FALSE with running time O mn2 and space O(nm),
if (P[m] = ’?’) then where the input string of length n, and m is
return IsMatch(S[1 . . . n − 1], P[1 . . . m − 1]) the length of the pattern.
if (P[m] , ’*’) then Being slightly more clever, one can get a
if P[m] , S[n] then return FALSE faster algorithm with running time O(nm).
else return IsMatch(S[1 . . . n − 1], P[1 . . . m − 1]) BTW, one can do even better. A O(m + n)
for i = 0 to n do time is possible but it requires Knuth-Morris-
if IsMatch(S[1 . . . i], P[1 . . . m − 1]) then Pratt algorithm, which is a fast string match-
return TRUE ing algorithm.
return FALSE

40
Figure 5.2: A drawing of the Mona Lisa by solving a TSP instance. The figure on the right is the TSP in the
eyes region.

Figure 5.3: A certain country and its optimal TSP tour.

5.6. Slightly faster TSP algorithm via dynamic programming

TSP: Traveling Salesperson Problem


Instance: A graph G = (V, E) with non-negative edge costs/lengths. Cost c(e) for each edge e ∈ E.
Question: Find a tour of minimum cost that visits each node.

No polynomial time algorithm known for TSP– the problem is NP-Hard.


Even an exponential Time algorithm requires some work. Indeed, there are n! potential TSP tours. Clearly,

n! ≤ nn = exp(n ln n) and n! ≥ (n/2)n/2 = exp((n/2) ln(n/2)). Using Stirling’s formula, we have n! ' n(n/e)n ,
which gives us a somewhat tighter estimate n! = Θ(2cn log n ) for some constant c > 1.
So, naively, any running time algorithm would have running time (at least) Ω(n!). Can we do better? Can
we get a ≈ 2O(n) running time algorithm in this case?

Towards a Recursive Solution.


(A) Order the vertices of V in some arbitrary order: v1, v2, . . . , vn .
(B) opt(S): optimum TSP tour for the vertices S ⊆ V in the graph restricted to S. We would like to compute
opt(V).

41
Can we compute opt(S) recursively?
(A) Say v ∈ S. What are the two neighbors of v in optimum tour in S?
(B) If u, w are neighbors of v in an optimum tour of S then removing v gives an optimum path from u to w
visiting all nodes in S − {v}.
Path from u to w is not a recursive subproblem! Need to find a more general problem to allow recursion.

We start with a more general problem: TSP Path.

TSP Path
Instance: A graph G = (V, E) with non-negative edge costs/lengths(c(e) for edge e) and two nodes
s, t.
Question: Find a path from s to t of minimum cost that visits each node exactly once.

We can solve the regular TSP problem using this problem.


We define a recursive problem for the optimum TSP Path problem, as follows:

opt(u, v, S) : optimum TSP Path from u to v in the graph restricted to S s.t. u, v ∈ S.

(A) What is the next node in the optimum path from u to v?


(B) Suppose it is w. Then what is opt(u, v, S)?
(C) opt(u, v, S) = c(u, w) + opt(w, v, S − {u})
(D) We do not know w! So try all possibilities for w.

A Recursive Solution.
(A) opt(u, v, S) = minw ∈S,w,u,v (c(u, w) + opt(w, v, S − {u}))
(B) What are the subproblems for the original problem opt(s, t, V)? For every subset S ⊆ V, we have the
subproblem opt(u, v, S) for u, v ∈ S.
As usual, we need to bound the number subproblems in the recursion:
(A) number of distinct subsets S of V is at most 2n
(B) number of pairs of nodes in a set S is at most n2
(C) hence number of subproblems is O(n2 2n )

Exercise 5.6.1. Show that one can compute TSP using above dynamic program in O(n3 2n ) time and O(n2 2n )
space.

Lemma 5.6.2. Given a graph G with n vertices, one can solve TSP in O(n3 2n ) time.

The disadvantage of dynamic programming solution is that it uses a lot of memory.

42
Part III
Approximation algorithms

Chapter 6

Approximation algorithms

6.1. Greedy algorithms and approximation algorithms


A natural tendency in solving algorithmic problems is to locally do whats seems to be the right thing. This is
usually referred to as greedy algorithms. The problem is that usually these kind of algorithms do not really
work. For example, consider the following optimization version of Vertex Cover:

VertexCoverMin
Instance: A graph G, and integer k.
Question: Return the smallest subset S ⊆ V(G), s.t. S touches all the edges of G.

For this problem, the greedy algorithm will always take the vertex with the high-
est degree (i.e., the one covering the largest number of edges), add it to the cover set,
remove it from the graph, and repeat. We will refer to this algorithm as GreedyVer-
texCover.
It is not too hard to see that this algorithm does not output the optimal vertex-
cover. Indeed, consider the graph depicted on the right. Clearly, the optimal solution
Figure 6.1: Example.
is the black vertices, but the greedy algorithm would pick the four yellow vertices.
This of course still leaves open the possibility that, while we do not get the optimal
vertex cover, what we get is a vertex cover which is “relatively good” (or “good enough”).
Definition 6.1.1. A minimization problem is an optimization problem, where we look for a valid solution that
minimizes a certain target function.

Example 6.1.2. In the VertexCoverMin problem the (minimization) target function is the size of the cover. For-
mally Opt(G) = minS ⊆V (G),S cover of G |S|.
The VertexCover(G) is just the set S realizing this minimum.

Definition 6.1.3. Let Opt(G) denote the value of the target function for the optimal solution.

43
Intuitively, a vertex-cover of size “close” to the optimal solution would be considered to be good.
Definition 6.1.4. Algorithm Alg for a minimization problem Min achieves an approximation factor α ≥ 1 if for
all inputs G, we have:
Alg(G)
≤ α.
Opt(G)
We will refer to Alg as an α-approximation algorithm for Min.

L R L R L R L R L R L R

Figure 6.2: Lower bound for greedy vertex cover.

As a concrete example, an algorithm is a 2-approximation for VertexCoverMin, if it outputs a vertex-cover


which is at most twice the size of the optimal solution for vertex cover.
So, how good (or bad) is the GreedyVertexCover algorithm described above? Well, the graph in Figure 6.1
shows that the approximation factor of GreedyVertexCover is at least 4/3.
It turns out that GreedyVertexCover performance is considerably worse. To this end, consider the following
bipartite graph: G n = (L ∪ R, E), where L is a set of n vertices. Next, for i = 2, . . . , n, we add a set Ri of bn/ic
vertices, to R, each one of them of degree i, such that all of them (i.e., all vertices of degree i in R) are connected
to distinct vertices in L. The execution of GreedyVertexCover on such a graph is shown in Figure 6.2.
Clearly, in G n all the vertices in L have degree at most n − 1, since they are connected to (at most) one
vertex of Ri , for i = 2, . . . , n. On the other hand, there is a vertex of degree n at R (i.e., the single vertex of Rn ).
Thus, GreedyVertexCover will first remove this vertex. We claim, that GreedyVertexCover will remove all the
vertices of R2, . . . , Rn and put them into the vertex-cover. To see that, observe that if R2, . . . , Ri are still active,
then all the nodes of Ri have degree i, all the vertices of L have degree at most i − 1, and all the vertices of
R2, . . . , Ri−1 have degree strictly smaller than i. As such, the greedy algorithms will use the vertices of Ri . Easy
induction now implies that all the vertices of R are going to be picked by GreedyVertexCover. This implies the
following lemma.
Lemma 6.1.5. The algorithm GreedyVertexCover is Ω(log n) approximation to the optimal solution to Vertex-
CoverMin.

Proof: Consider the graph G n above. The optimal solution is to pick all the vertices of L to the vertex cover,
which results in a cover of size n. On the other hand, the greedy algorithm picks the set R. We have that
n n j k n  n
Õ Õ n Õ n  Õ 1
|R| = |Ri | = ≥ −1 ≥ n − 2n = n(Hn − 2).
i=2 i=2
i i=2
i i=1
i

Here, Hn = = lg n + Θ(1) is the nth harmonic number. As such, the approximation ratio for GreedyVer-
Ín
i=1 1/i
|R| n(Hn − 2)
texCover is ≥ = = Ω(log n).
|L| n

44
Theorem 6.1.6. The greedy algorithm for VertexCover achieves Θ(log n) approximation, where n is the number
of vertices in the graph. Its running time is O(mn2 ).

Proof: The lower bound follows from Lemma 6.1.5. The upper bound follows from the analysis of the greedy
of Set Cover, which will be done shortly.
As for the running time, each iteration of the algorithm takes O(mn) time, and there are at most n iterations.

6.1.1. Alternative algorithm – two for the price of one


One can still do much better than the greedy algorithm in this case. In particular, let ApproxVertexCover
be the algorithm that chooses an edge from G, add both endpoints to the vertex cover, and removes the two
vertices (and all the edges adjacent to these two vertices) from G. This process is repeated till G has no edges.
Clearly, the resulting set of vertices is a vertex-cover, since the algorithm removes an edge only if it is being
covered by the generated cover.

Theorem 6.1.7. ApproxVertexCover is a 2-approximation algorithm for VertexCoverMin that runs in O(n2 )
time.

Proof: Every edge picked by the algorithm contains at least one vertex of the optimal solution. As such, the
cover generated is at most twice larger than the optimal.

6.2. Fixed parameter tractability, approximation, and fast exponential time


algorithms (to say nothing of the dog)
6.2.1. A silly brute force algorithm for vertex cover
So given a graph G = (V, E) with n vertices, we can approximate VertexCoverMin up to a factor of two in
polynomial time. Let K be this approximation – we know that any vertex cover in G must be of size at least
K/2, and we have a cover of size K. Imagine the case that K is truly small – can we compute the optimal
vertex-cover in this case quickly? Well, of course, we could just try all possible subsets of vertices size at most
K, and check for each one whether it is a cover or not. Checking if a specific set of vertices is a cover takes
O(m) = O(n2 ) time, where m = |E|. So, the running time of this algorithm is
K     K
Õ n Õ    
O n ≤
2
O n ·n =O n
i 2 K+2
,
i=1
i i=1

where ni is the number of subsets of the vertices of G of size exactly i. Observe that we do not require to know


K – the algorithm can just try all sizes of subsets, till it finds a solution. We thus get the following (not very
interesting result).

Lemma 6.2.1. Given a graph G = (V, E) with n vertices, one can solve VertexCoverMin in O nα+2 time, where


α is the size the minimum vertex cover.

6.2.2. A fixed parameter tractable algorithm


As before, our input is a graph G = (V, E), for which we want to compute a vertex-cover of minimum size. We
need the following definition:

Definition 6.2.2. Let G = (V, E) be a graph. For a subset S ⊆ V, let GS be the induced subgraph over S.
Namely, it is a graph with the set of vertices being S. For any pair of vertices x, y ∈ V, we have that the edge
xy ∈ E(GS ) if and only if x y ∈ E(G), and x, y ∈ S.

45
fpVertexCoverInner (X, β)
// Computes minimum vertex cover for the induced graph GX
// β: size of VC computed so far.
if X = ∅ or GX has no edges then return β
e ← any edge uv of GX .
β1 = f pV ertexCover Inner(X \ {u, v} , β + 2)
// Only take u to the cover, but then we must also take
// all the vertices that are neighbors of v,
// to cover their edges with v
β2 = f pV ertexCover Inner X \ {u} ∪ NG X (v) , β + NG X (v)
 

// Only take v to the cover...


β3 = f pV ertexCover Inner X \ {v} ∪ NG X (u) , β + NG X (u)
 

return min(β1, β2, β3 ).

algFPVertexCover (G = (V, E))


return fpVertexCoverInner (V, 0)

Figure 6.3: Fixed parameter tractable algorithm for VertexCoverMin.

Also, in the following, for a vertex v, let NG (v) denote the set of vertices of G that are adjacent to v.
Consider an edge e = uv in G. We know that either u or v (or both) must be in any vertex cover of G, so
consider the brute force algorithm for VertexCoverMin that tries all these possibilities. The resulting algorithm
algFPVertexCover is depicted in Figure 6.3.

Lemma 6.2.3. The algorithm algFPVertexCover (depicted in Figure 6.3) returns the optimal solution to the
given instance of VertexCoverMin.

Proof: It is easy to verify, that if the algorithm returns β then it found a vertex cover of size β. Since the depth
of the recursion is at most n, it follows that this algorithm always terminates.
Consider the optimal solution Y ⊆ V, and run the algorithm, where every stage of the recursion always pick
the option that complies with the optimal solution. Clearly, since in every level of the recursion at least one
vertex of Y is being found, then after O(|Y |) recursive calls, the remaining graph would have no edges, and it
would return |Y | as one of the candidate solution. Furthermore, since the algorithm always returns the minimum
solution encountered, it follows that it would return the optimal solution.

Lemma 6.2.4. The depth of the recursion of algFPVertexCover(G) is at most α, where α is the minimum size
vertex cover in G.

Proof: The idea is to consider all the vertices that can be added to the vertex cover being computed without
covering any new edge. In particular, in the case the algorithm takes both u and v to the cover, then one of
these vertices must be in the optimal solution, and this can happen at most α times.
The more interesting case, is when the algorithm picks NG X (v) (i.e., β2 ) to the vertex cover. We can add
v to the vertex cover in this case without getting any new edges being covered (again, we are doing this only
conceptually – the vertex cover computed by the algorithm would not contain v [only its neighbors]). We do
the same thing for the case of β3 .
Now, observe that in any of these cases, the hypothetical set cover being constructed (which has more vertices
than what the algorithm computes, but covers exactly the same set of edges in the original graph) contains one
vertex of the optimal solution picked into itself in each level of the recursion. Clearly, the algorithm is done
once we pick all the vertices of the optimal solution into the hypothetical vertex cover. It follows that the depth
the recursion is ≤ α.

46
Theorem 6.2.5. Let G be a graph with n vertices, and with the minimal vertex cover being of size α. Then,
the algorithm algFPVertexCover (depicted in Figure 6.3) returns the optimal vertex cover for G and the running
time of this algorithm is O 3α n2 .

Proof: By Lemma 6.2.4, the recursion tree has depth α. As such, it contains at most 2 · 3α nodes. Each node
in the recursion requires O n work (ignoring the recursive calls), if implemented naively. Thus, the bound on
2


the running time follows.

Algorithms where the running time is of the form O(nc f (α)), where α is some parameter that depends on
the problem are fixed parameter tractable algorithms for the given problem.

6.2.2.1. Remarks
Currently, the fastest algorithm known for this problem has running time O(1.2738α + αn) [CKX10]. This
algorithm uses similar ideas, but is considerably more complicated.
 is possible for VertexCoverMin, unless P = NP. The
It is known that no better approximationthan 1.3606
currently best approximation known is 2 − Θ 1/ log n . If the Unique Games Conjecture is true, then no
p

better constant approximation is possible in polynomial time.

6.3. Approximating maximum matching


Definition 6.3.1. Consider an undirected graph G = (V, E). The graph might have a weight function ω(e),
specifying a positive value on the edges of G (if no weights are specified, treat every edge as having weight 1).
• A subset M ⊆ E is a matching if no pair of edges of M share endpoints.
• A perfect matching is a matching that covers all the vertices of G.
• A min-weight perfect matching, is the minimum weight matching among all perfect matching, where
the weight of a matching is Õ
ω(M) = ω(e).
e∈M

• The maximum-weight matching (or just maximum matching is the matching with maximum weight
among all matchings.
• A matching M is maximal if no edge can be added to it. That is, for every edge e ∈ E, we have that the
edges of M contains at least one endpoint of e.

Note the subtle difference between maximal and maximum – the first, is a local maximum, while the other one
is the global maximum.

Lemma 6.3.2. Given an undirected unweighted graph G with n vertices and m edges, one can compute a
matching M in G, such that |M | ≥ |opt| /2, where opt is the maximum size (i.e., cardinality) matching in G.
The running time is O(n + m).

Proof: The algorithm is shockingly simple – repeatedly pick an edge of G, remove it and the edges adjacent to
it, and repeat till there are no edges left in the graph. Let M be the resulting matching.
To see why this is a two approximation (i.e., 2|M | ≥ |opt|, observe that every edge of M is adjacent to at
most two edges of opt. As such, each edge of M pays for two edges of opt, which implies the claim.
One way to see that is to imagine that we start with the matching opt and let M = {m1, . . . , mt } – at each
iteration, we insert mi into the current matching, and remove any old edges that intersect it. As such, we moved
from the matching of M to the matching of opt. In each step, we deleted at most two edges, and inserted one
edges. As such, |opt| ≤ 2|M |.

47
Lemma 6.3.3. Given an undirected weighted graph G with n vertices and m edges, one can compute a matching
M in G, such that ω(M) ≥ ω(opt)/2, where opt is the maximum weight matching in G. The running time is
O(n log n + m).

Proof: We run the algorithm for the unweighted case, with the modification that we always pick the heaviest
edge still available. The same argument as in Lemma 6.3.2 implies that that this is a two approximation. As
for the running time – we need a min-heap for m elements, that performs at most n deletions, and as such, the
running time is O(n log n + m) by using a Fibonacci heap.

Remark 6.3.4. Note, that maximum matching (and all the variants mentioned above) are solvable in polynomial
time. The main thing is that the above algorithm is both simple and give us a decent starting point which can
be used in the exact algorithm.

6.4. Graph diameter


FILL IN.

6.5. Traveling Salesman Person


We remind the reader that the optimization variant of the TSP problem is the following.

TSP-Min
Instance: G = (V, E) a complete graph, and ω(e) a cost function on edges of G.
Question: The cheapest tour that visits all the vertices of G exactly once.

Theorem 6.5.1. TSP-Min can not be approximated within any factor unless NP = P.

Proof: Consider the reduction from Hamiltonian Cycle into TSP. Given a graph G, which is the input for the
Hamiltonian cycle, we transform it into an instance of TSP-Min. Specifically, we set the weight of every edge to
1 if it was present in the instance of the Hamiltonian cycle, and 2 otherwise. In the resulting complete graph,
if there is a tour price n then there is a Hamiltonian cycle in the original graph. If on the other hand, there
was no cycle in G then the cheapest TSP is of price n + 1.
Instead of 2, let us assign the missing edges in H a weight of cn, for c an arbitrary number. Let H denote
the resulting graph. Clearly, if G does not contain any Hamiltonian cycle in the original graph, then the price
of the TSP-Min in H is at least cn + 1.
Note, that the prices of tours of H are either (i) equal to n if there is a Hamiltonian cycle in G, or (ii) larger
than cn + 1 if there is no Hamiltonian cycle in G. As such, if one can do a c-approximation, in polynomial time,
to TSP-Min, then using it on H would yield a tour of price ≤ cn if a tour of price n exists. But a tour of price
≤ cn exists if and only if G has a Hamiltonian cycle.
Namely, such an approximation algorithm would solve a NP-Complete problem (i.e., Hamiltonian Cycle)
in polynomial time.

Note, that Theorem 6.5.1 implies that TSP-Min can not be approximated to within any factor. However, once
we add some assumptions to the problem, it becomes much more manageable (at least as far as approximation).
What the above reduction did, was to take a problem and reduce it into an instance where this is a huge gap,
between the optimal solution, and the second cheapest solution. Next, we argued that if had an approximation
algorithm that has ratio better than the ratio between the two endpoints of this empty interval, then the
approximation algorithm, would in polynomial time would be able to decide if there is an optimal solution.

48
6.5.1. TSP with the triangle inequality
6.5.1.1. A 2-approximation
Consider the following special case of TSP:

TSP4, -Min
Instance: G = (V, E) is a complete graph. There is also a cost function ω(·) defined over the edges of
G, that complies with the triangle inequality.
Question: The cheapest tour that visits all the vertices of G exactly once.

We remind the reader that the triangle inequality holds for ω(·) if

∀u, v, w ∈ V(G), ω(u, v) ≤ ω(u, w) + ω(w, v).

The triangle inequality implies that if we have a path σ in G, that starts at s and ends at t, then ω(st) ≤ ω(σ).
Namely, shortcutting, that is going directly from s to t, is always beneficial if the triangle inequality holds
(assuming that we do not have any reason to visit the other vertices of σ).
Definition 6.5.2. A cycle in a graph G is Eulerian if it visits every edge of G exactly once.

Unlike Hamiltonian cycle, which has to visit every vertex exactly once, an Eulerian cycle might visit a vertex
an arbitrary number of times. We need the following classical result:
Lemma 6.5.3. A graph G has a cycle that visits every edge of G exactly once (i.e., an Eulerian cycle) if and
only if G is connected, and all the vertices have even degree. Such a cycle can be computed in O(n + m) time,
where n and m are the number of vertices and edges of G, respectively.

Our purpose is to come up with a 2-approximation algorithm for TSP 4, -Min. To this end, let Copt denote
the optimal TSP tour in G. Observe that Copt is a spanning graph of G, and as such we have that
 
ω Copt ≥ weight cheapest spanning graph of G .


But the cheapest spanning graph of G, is the minimum spanning tree (MST) of G, and as such ω Copt ≥


ω(MST(G)). The MST can be computed in O(n log n + m) = O(n2 ) time, where n is the number of vertices of G,
and m = n2 is the number of edges (since G is the complete graph). Let T denote the MST of G, and covert
T into a tour by duplicating every edge twice. Let H denote the new graph. We have that H is a connected
graph, every vertex of H has even degree, and as such H has an Eulerian tour (i.e., a tour that visits every edge
of H exactly once).
As such, let C denote the Eulerian cycle in H. Observe that

ω(C) = ω(H) = 2ω(T) = 2ω(MST(G)) ≤ 2ω Copt .




Next, we traverse C starting from any vertex v ∈ V(C). As we traverse C, we skip vertices that we already
visited, and in particular, the new tour we extract from C will visit the vertices of V(G) in the order they first
appear in C. Let π denote the new tour of G. Clearly, since we are performing shortcutting, and the triangle
inequality holds, we have that ω(π) ≤ ω(C). The resulting algorithm is depicted in Figure 6.4.
It is easy to verify, that all the steps of our algorithm can be done in polynomial time. As such, we have
the following result.
Theorem 6.5.4. Given an instance of TSP with the triangle inequality (TSP 4, -Min) (namely, a graph G with
n vertices and n2 edges, and a cost function ω(·) on the edges that comply with the triangle inequality), one
can compute a tour of G of length ≤ 2ω Copt , where Copt is the minimum cost TSP tour of G. The running
time of the algorithm is O n .
2


49
(a) (b) (c) (d)

Figure 6.4: The TSP approximation algorithm: (a) the input, (b) the duplicated graph, (c) the extracted
Eulerian tour, and (d) the resulting shortcut path.

6.5.1.2. A 3/2-approximation to TSP 4, -Min


The following is a known result, and we will see a somewhat weaker version of it in class.

Theorem 6.5.5. Given a graph G and weights on the edges, one can compute the min-weight perfect matching
of G in polynomial time.

Lemma 6.5.6. Let G = (V, E) be a complete graph, S a subset of the vertices of V of even size, and ω(·) a
weight function over the edges. Then, the weight of the min-weight perfect matching in G S is ≤ ω(TSP(G))/2.

Proof: Let π be the cycle realizing the TSP in G. Let σ be the cycle
resulting from shortcutting π so that it uses only the vertices of S. Clearly, S π
ω(σ) ≤ ω(π). Now, let Me and Mo be the sets of even and odd edges of σ
respectively. Clearly, both Mo and Me are perfect matching in G S , and σ
ω(Mo ) + ω(Me ) = ω(σ).

We conclude, that min(w(Mo ), w(Me )) ≤ ω(TSP(G))/2.

We now have a creature that has the weight of half of the TSP, and 4 5
3
we can compute it in polynomial time. How to use it to approximate the 7
TSP? The idea is that we can make the MST of G into an Eulerian graph
by being more careful. To this end, consider the tree on the right. Clearly, 2
it is almost Eulerian, except for these pesky odd degree vertices. Indeed, if 1 6
all the vertices of the spanning tree had even degree, then the graph would be Eulerian (see Lemma 6.5.3).
In particular, in the depicted tree, the “problematic” vertices are 1, 4, 2, 7, since they are all the odd degree
vertices in the MST T.

Lemma 6.5.7. The number of odd degree vertices in any graph G 0 is even.

Proof: Observe that µ = v ∈V (G0) d(v) = 2|E(G 0)|, where d(v) denotes the degree of v. Let U = v ∈V (G0),d(v) is even d(v),
Í Í
and observe that U is even as it is the sum of even numbers.

50
Thus, ignoring vertices of even degree, we have
Õ
α= d(v) = µ − U = even number,
v ∈V ,d(v) is odd

since µ and U are both even. Thus, the number of elements in the above sum of all odd numbers must be even,
since the total sum is even.

So, we have an even number of problematic vertices in T. The idea 4 5


now is to compute a minimum-weight perfect matching M on the prob- 3
7
lematic vertices, and add the edges of the matching to the tree. The
resulting graph, for our running example, is depicted on the right. Let 2
H = (V, E(M) ∪ E(T)) denote this graph, which is the result of adding M 1 6
to T.
We observe that H is Eulerian, as all the vertices now have even degree, and the graph is connected. We
also have

ω(H) = ω(MST(G)) + ω(M) ≤ ω(TSP(G)) + ω(TSP(G))/2 = (3/2)ω(T SP(G)),

by Lemma 6.5.6. Now, H is Eulerian, and one can compute the Euler cycle for H, shortcut it, and get a tour
of the vertices of G of weight ≤ (3/2)ω(TSP(G)).

Theorem 6.5.8. Given an instance of TSP with the triangle inequality, one can compute in polynomial time,
a (3/2)-approximation to the optimal TSP.

6.6. Biographical Notes


The 3/2-approximation for TSP with the triangle inequality is due to Christofides [Chr76].

Chapter 7

Approximation algorithms II

7.1. Max Exact 3SAT


We remind the reader that an instance of 3SAT is a boolean formula, for example F = (x1 + x2 + x3 )(x4 + x1 + x2 ),
and the decision problem is to decide if the formula has a satisfiable assignment. Interestingly, we can turn this
into an optimization problem.

Max 3SAT
Instance: A collection of clauses: C1, . . . , Cm .
Question: Find the assignment to x1, ..., xn that satisfies the maximum number of clauses.

51
Clearly, since 3SAT is NP-Complete it implies that Max 3SAT is NP-Hard. In particular, the formula F
becomes the following set of two clauses:

x1 + x2 + x3 and x4 + x1 + x2 .

Note, that Max 3SAT is a maximization problem.


Definition 7.1.1. Algorithm Alg for a maximization problem achieves an approximation factor α if for all inputs,
we have:
Alg(G)
≥ α.
Opt(G)

In the following, we present a randomized algorithm – it is allowed to consult with a source of random
numbers in making decisions. A key property we need about random variables, is the linearity of expectation
property, which is easy to derive directly from the definition of expectation.
Definition 7.1.2
 (Linearity
 of expectations.). Given two random variables X,Y (not necessarily independent, we
have that E X + Y = E X + E Y .
  

Theorem 7.1.3. One can achieve (in expectation) (7/8)-approximation to Max 3SAT in polynomial time.
Namely, if the instance has m clauses, then the generated assignment satisfies (7/8)m clauses in expectation.

Proof: Let x1, . . . , xn be the n variables used in the given instance. The algorithm works by randomly assigning
values to x1, . . . , xn , independently, and equal probability, to 0 or 1, for each one of the variables.
Let Yi be the indicator variables which is 1 if (and only if) the ith clause is satisfied by the generated random
assignment and 0 otherwise, for i = 1, . . . , m. Formally, we have
(
1 Ci is satisfied by the generated assignment,
Yi =
0 otherwise.

Now, the number of clauses satisfied by the given assignment is Y = i=1 Yi . We claim that E[Y ] = (7/8)m,
Ím
where m is the number of clauses in the input. Indeed, we have
m
hÕ i Õm
Y Y E Yi
   
E = E i =
i=1 i=1

by linearity of expectation. Now, what is the probability that Yi = 0? This is the probability that all three
literals appear in the clause Ci are evaluated to FALSE. Since the three literals are instance of three distinct
variable, these three events are independent, and as such the probability for this happening is
 1 1 1 1
P Yi = 0 = ∗ ∗ = .

2 2 2 8
(Another way to see this, is to observe that since Ci has exactly three literals, there is only one possible
assignment to the three variables appearing in it, such that the clause evaluates to FALSE. Now, there are eight
(8) possible assignments to this clause, and thus the probability of picking a FALSE assignment is 1/8.) Thus,
 7
P Yi = 1 = 1 − P Yi = 0 = ,
  
8
and
7
E Yi = P Yi = 0 ∗ 0 + P Yi = 1 ∗ 1 = .
     
8
Namely, E[# of clauses sat] = E[Y ] = i=1 E[Yi ] = (7/8)m. Since the optimal solution satisfies at most m clauses,
Ím
the claim follows.

52
Curiously, Theorem 7.1.3 is stronger than what one usually would be able to get for an approximation
algorithm. Here, the approximation quality is independent of how well the optimal solution does (the optimal
can satisfy at most m clauses, as such we get a (7/8)-approximation. Curiouser and curiouser¬ , the algorithm
does not even look on the input when generating the random assignment.
Håstad [Hås01a] proved that one can do no better; that is, for any constant ε > 0, one can not approximate
3SAT in polynomial time (unless P = NP) to within a factor of 7/8 + ε. It is pretty amazing that a trivial
algorithm like the above is essentially optimal.

7.2. Approximation Algorithms for Set Cover


7.2.1. Guarding an Art Gallery

You are given the floor plan of an art gallery, which is a two dimensional
simple polygon. You would like to place guards that see the whole polygon. A
guard is a point, which can see all points around it, but it can not see through
walls. Formally, a point p can see a point q, if the segment pq is contained
inside the polygon. See figure on the right, for an illustration of how the input
looks like.
A visibility polygon at p (depicted as the yellow polygon on the left) is
the region inside the polygon that p can see. We would like to find the minimal
number of guards needed to guard the given art-gallery? That is, all the points
p
in the art gallery should be visible from at least one guard we place.
The art-gallery problem is a set-cover problem. We have a ground set (the
polygon), and family of sets (the set of all visibility polygons), and the target
is to find a minimal number of sets covering the whole polygon.
It is known that finding the minimum number of guards needed is NP-Hard. No approximation is currently
known. It is also known that a polygon with n corners, can be guarded using n/3 + 1 guards. Note, that
this problem is harder than the classical set-cover problem because the number of subsets is infinite and the
underlining base set is also infinite.
An interesting open problem is to find a polynomial time approximation algorithm, such that given P, it

computes a set of guards, such that #guar ds ≤ nk opt , where n is the number of vertices of the input polygon
P, and k opt is the number of guards used by the optimal solution.

7.2.2. Set Cover


The optimization version of Set Cover, is the following:

Set Cover
Instance: (S, F):
S - a set of n elements
F - a family of subsets of S, s.t. X ∈F X = S.
Ð
Question: The set X ⊆ F such that X contains as few sets as possible, and X covers S. Formally,
X ∈X X = S.
Ð

The set S is sometime called the ground set, and a pair (S, F) is either called a set system or a hypergraph.
Note, that Set Cover is a minimization problem which is also NP-Hard.
¬ “Curiouser and curiouser!” Cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good
English). – Alice in wonderland, Lewis Carol

53
GreedySetCover(S, F)
X ← ∅; T ← S
while T is not empty do
U ← set in F covering largest
# of elements in T
X ← X ∪ {U}
T ←T \U

return X.

Figure 7.1

Example 7.2.1. Consider the set S = {1, 2, 3, 4, 5} and the following family of subsets

F = {{1, 2, 3}, {2, 5}, {1, 4}, {4, 5}} .

Clearly, the smallest cover of S is Xopt = {{1, 2, 3}, {4, 5}}.

The greedy algorithm GreedySetCover for this problem is depicted in Figure 7.1. Here, the algorithm always
picks the set in the family that covers the largest number of elements not covered yet. Clearly, the algorithm
is polynomial in the input size. Indeed, we are given a set S of n elements, and m subsets. As such, the input
size is at least Ω(m + n) (and at most of size O(mn)), and the algorithm takes time polynomial in m and n. Let
Xopt = {V1, . . . ,Vk } be the optimal solution.
Let Ti denote the elements not covered in the beginning ith iteration of GreedySetCover, where T1 = S. Let
Ui be the set added to the cover in the ith iteration, and αi = |Ui ∩ Ti | be the number of new elements being
covered in the ith iteration.

Claim 7.2.2. We have α1 ≥ α2 ≥ . . . ≥ αk ≥ . . . ≥ αm .

Proof: If αi < αi+1 then Ui+1 covers more elements than Ui and we can exchange between them, as we found
a set that in the ith iteration covers more elements that the set used by GreedySetCover. Namely, in the ith
iteration we would use Ui+1 instead of Ui . This contradicts the greediness of GreedySetCover of choosing the
set covering the largest number of elements not covered yet. A contradiction.

Claim 7.2.3. We have αi ≥ |Ti | /k. Namely, |Ti+1 | ≤ (1 − 1/k) |Ti |.

Proof: Consider the optimal solution. It is made out of k sets and it covers S, and as such it covers Ti ⊆ S.
This implies that one of the subsets in the optimal solution cover at least 1/k fraction of the elements of Ti .
Finally, the greedy algorithm picks the set that covers the largest number of elements of Ti . Thus, Ui covers at
least αi ≥ |Ti |/k elements.
As for the second claim, we have that |Ti+1 | = |Ti | − αi ≤ (1 − 1/k) |Ti |.

Theorem 7.2.4. The algorithm GreedySetCover generates a cover of S using at most O(k log n) sets of F, where
k is the size of the cover in the optimal solution.

Proof: We have that |Ti | ≤ (1 − 1/k) |Ti−1 | ≤ (1 − 1/k)i |T0 | = (1 − 1/k)i n. In particular, for M = d2k ln ne we have
M
d2k ln ne
    
1 1 1
|TM | ≤ 1 − n ≤ exp − M n = exp − n ≤ < 1,
k k k n

since 1 − x ≤ e−x , for x ≥ 0. Namely, |TM | = 0. As such, the algorithm terminates before reaching the Mth
iteration, and as such it outputs a cover of size O(k log n), as claimed.

54
7.2.3. Lower bound
The lower bound example is depicted in the following figure.

Y4

Z4

X1 X2 X3 X4
We provide a more formal description of this lower bound next, and prove that it shows Ω(log n) approxi-
mation to GreedySetCover.
We want to show here that the greedy algorithm analysis is tight. To this end, consider the set system
Λi = (Si , Fi ), where Si = Yi ∪ Zi , Yi = {y1, . . . , y2i −1 } and Zi = {z1, . . . , z2i −1 }. The family of sets Fi contains the
following sets
X j = {y2 j−1 , . . . , y2 j −1, z2 j−1 , . . . , z2 j −1 } ,
for j = 1, . . . , i. Furthermore, Fi also contains the two special sets Yi and Zi . Clearly, minimum set cover for Λi
is the two sets Yi and Zi .
However, sets Yi and Zi have size 2i − 1. But, the set Xi has size
 
|Xi | = 2 2i − 1 − 2i−1 + 1 = 2i ,

and this is the largest set in Λi . As such, the greedy algorithm GreedySetCover would pick Xi as first set to its
cover. However, once you remove Xi from Λi (and from its ground set), you remain with the set system Λi−1 .
We conclude that GreedySetCover would pick the sets Xi , Xi−1, . . . , X1 to the cover, while the optimal cover is
by two sets. We conclude:

Lemma 7.2.5. Let n = 2i+1 − 2. There exists an instance of Set Cover of n elements, for which the optimal
cover is by two sets, but GreedySetCover would use i = blg nc sets for the cover. That is, GreedySetCover is a
Θ(log n) approximation to SetCover.

7.2.4. Just for fun – weighted set cover

Weighted Set Cover


Instance: (S, F, ρ):
S: a set of n elements
F: a family of subsets of S, s.t. X ∈F X = S.
Ð
ρ(·): A price function assigning price to each set in F.
Question: The set X ⊆ F, such that X covers S. Formally, X ∈X X = S, and ρ(X) = X ∈X ρ(X) is
Ð Í
minimized.

The greedy algorithm in this case, WGreedySetCover, repeatedly picks the set that pays the least cover
each element it cover. Specifically, if a set X ∈ F covered t new elements, then the average price it pays per
element it cover is α(X) = ρ(X)/t. WGreedySetCover as such, picks the set with the lowest average price. Our
purpose here to prove that this greedy algorithm provides O(log n) approximation.

7.2.4.1. Analysis
Let Ui be the set of elements that are not covered yet in the end of the ith iteration. As such, U0 = S. At the
beginning of the ith iteration, the average optimal cost is αi = ρ(opt)/ni , where opt is the optimal solution
and ni = |Ui−1 | is the number of uncovered elements.

55
Lemma 7.2.6. We have that:
(A) α1 ≤ α2 ≤ · · · .
(B) For i < j, we have 2αi ≤ α j only if n j ≤ ni /2.
Proof: (A) is hopefully obvious – as the number of elements not covered decreases, the average price to cover
the remaining elements using the optimal solution goes up.
(B) 2αi ≤ α j implies that 2ρ(opt)/ni ≤ ρ(opt)/n j , which implies in turn that 2n j ≤ ni .
So, let k be the first iteration such that nk ≤ n/2. The basic idea is that total price that WGreedySetCover
paid during these iterations is at most 2ρ(opt). This immediately implies O(log n) iteration, since this can
happen at most O(log n) times till the ground set is fully covered.
To this end, we need the following technical lemma.
Lemma 7.2.7. Let Ui−1 be the set of elements not yet covered in the beginning of the ith iteration, and let
αi = ρ(opt)/ni be the average optimal cost per element. Then, there exists a set X in the optimal solution, with
lower average cost; that is, ρ(X)/|Ui−1 ∩ X | ≤ αi .
Proof: Let X1, . . . , Xm be the sets used in the optimal solution. Let s j = Ui−1 ∩ X j , for j = 1, . . . , m, be the
number of new elements covered by each one of these sets. Similarly, let ρ j = ρ X j , for j = 1, . . . , m. The
average cost of the jth set is ρ j /s j (it is +∞ if s j = 0). It is easy to verify that
Ím
m ρj j=1 ρ j ρ(opt) ρ(opt)
min ≤ Ím = Ím ≤ = αi .
j=1 s j j=1 js s
j=1 j |Ui−1 |
a a+c c
The first inequality follows as a/b ≤ c/d (all positive numbers), then ≤ ≤ . In particular, for any such
a c  a + c b b+d d
numbers min , ≤ , and applying this repeatedly implies this inequality. The second inequality follows
b d b+d
as m j=1 s j ≥ |Ui−1 |. This implies that the optimal solution must contain a set with an average cost smaller than
Í
the average optimal cost.
Lemma 7.2.8. Let k be the first iteration such that nk ≤ n/2. The total price of the sets picked in iteration 1
to k − 1, is at most 2ρ(opt).
Proof: By Lemma 7.2.7, at each iteration the algorithm picks a set with average cost that is smaller than the
optimal average cost (which goes up in each iteration). However, the optimal average cost iterations, 1 to k − 1,
is at most twice the starting cost, since the number of elements not covered is at least half the total number
of elements. It follows, that the for each element covered, the greedy algorithm paid at most twice the initial
optimal average cost. So, if the number of elements covered by the beginning of the kth iteration is β ≥ n/2,
then the total price paid is 2α1 β = 2(ρ(opt)/n)β ≤ 2ρ(opt), implying the claim.
Theorem 7.2.9. WGreedySetCover computes a O(log n) approximation to the optimal weighted set cover solu-
tion.
Proof: WGreedySetCover paid at most twice the optimal solution to cover half the elements, by Lemma 7.2.8.
Now, you can repeat the argument on the remaining uncovered elements. Clearly, after O(log n) such halving
steps, all the sets would be covered. In each halving step, WGreedySetCover paid at most twice the optimal
cost.

7.3. Biographical Notes


The Max 3SAT remains hard in the “easier” variant of MAX 2SAT version, where every clause has 2 variables.
It is known to be NP-Hard and approximable within 1.0741 [FG95], and is not approximable within 1.0476
[Hås01a]. Notice, that the fact that MAX 2SAT is hard to approximate is surprising as 2SAT can be solved in
polynomial time (!).

56
Chapter 8

Approximation algorithms III

8.1. Clustering
Consider the problem of unsupervised learning. We are given a set of examples, and we would like to partition
them into classes of similar examples. For example, given a webpage X about “The reality dysfunction”, one
would like to find all webpages on this topic (or closely related topics). Similarly, a webpage about “All quiet
on the western front” should be in the same group as webpage as “Storm of steel” (since both are about soldier
experiences in World War I).
The hope is that all such webpages of interest would be in the same cluster as X, if the clustering is good.
More formally, the input is a set of examples, usually interpreted as points in high dimensions. For example,
given a webpage W, we represent it as a point in high dimensions, by setting the ith coordinate to 1 if the word
wi appears somewhere in the document, where we have a prespecified list of 10, 000 words that we care about.
Thus, the webpage W can be interpreted as a point of the {0, 1}10,000 hypercube; namely, a point in 10, 000
dimensions.
Let X be the resulting set of n points in d dimensions.
To be able to partition points into similar clusters, we need to define a notion of similarity. Such a similarity
measure can be any distance function between points. For example, consider the “regular” Euclidean distance
between points, where v
u
t d
Õ
kp − qk = (pi − qi )2,
i=1

where p = (p1, . . . , pd ) and q = (q1, . . . , qd ).


As another motivating example, consider the facility location problem. We are given a set X of n cities
and distances between them, and we would like to build k hospitals, so that the maximum distance of a city
from its closest hospital is minimized. (So that the maximum time it would take a patient to get to the its
closest hospital is bounded.)
Intuitively, what we are interested in is selecting good representatives for the input point-set X. Namely,
we would like to find k points in X such that they represent X “well”.
Formally, consider a subset S of k points of X, and a p a point of X. The distance of p from the set S is

d(p, S) = min kp − qk ;
q ∈S

namely, d(p, S) is the minimum distance of a point of S to p. If we interpret S as a set of centers then d(p, S) is
the distance of p to its closest center.
Now, the price of clustering X by the set S is

ν(X, S) = max d(p, S).


p ∈X

This is the maximum distance of a point of X from its closest center in S.


It is somewhat illuminating to consider the problem in the plane. We have
a set P of n points in the plane, we would like to find k smallest discs centered
at input points, such that they cover all the points of P. Consider the example
on the right.

57
In this example, assume that we would like to cover it by 3 disks. One
possible solution is being shown in Figure 8.1. The quality of the solution
is the radius r of the largest disk. As such, the clustering problem here can
be interpreted as the problem of computing an optimal cover of the input
point set by k discs/balls of minimum radius. This is known as the k-center
problem.
It is known that k-center clustering is NP-Hard, even to approximate
within a factor of (roughly) 1.8. Interestingly, there is a simple and elegant
2-approximation algorithm. Namely, one can compute in polynomial time,
Figure 8.1: The marked point k centers, such that they induce balls of radius at most twice the optimal
is the bottleneck point. radius.
Here is the formal definition of the k-center clustering problem.

k-center clustering
Instance: A set P a of n points, a distance function d(p, q), for p, q ∈ P, with triangle inequality
holding for d(·, ·), and a parameter k.
Question: A subset S that realizes ropt (P, k) = min DS (P), where DS (P) = max x ∈X d(x, S) and
S ⊆P, |S |=k
d(x, S) = mins ∈S d(s, x).

8.1.1. The approximation algorithm for k-center clustering

To come up with the idea behind the algorithm, imagine that we already
have a solution with m = 3 centers. We would like to pick the next m + 1
center. Inspecting the examples above, one realizes that the solution is being
determined by a bottleneck point; see Figure 8.1. That is, there is a single point
which determine the quality of the clustering, which is the point furthest away
from the set of centers. As such, the natural step is to find a new center that
would better serve this bottleneck point. And, what can be a better service for
this point, than make it the next center? (The resulting clustering using the
new center for the example is depicted on the right.)
Namely, we always pick the bottleneck point, which is furthest away for the current set of centers, as the
next center to be added to the solution.

The resulting approximation algorithm is depicted on the AprxKCenter(P, k)


right. Observe, that the quantity ri+1 denotes the (minimum) P = {p1, . . . , pn }
radius of the i balls centered at u1, . . . , ui such that they cover S = {p1 }, u1 ← p1
P (where all these balls have the same radius). (Namely, there while |S| < k do
is a point p ∈ P such that d(p, {u1, . . . , ui }) = ri+1 . i ← |S|
It would be convenient, for the sake of analysis, to imagine for j = 1 . . . n do
d j ← min d j , d p j , ui

that we run AprxKCenter one additional iteration, so that the
quantity rk+1 is well defined. ri+1 ← max(d1, . . . , dn )
Observe, that the running time of the algorithm AprxK- ui+1 ←point of P realizing ri
Center is O(nk) as can be easily verified. S ← S ∪ {ui+1 }

Lemma 8.1.1. We have that r2 ≥ . . . ≥ rk ≥ rk+1 . return S

Proof: At each iteration the algorithm adds one new center, and as such the distance of a point to the closest
center can not increase. In particular, the distance of the furthest point to the centers does not increase.

58
Observation 8.1.2. The radius of the clustering generated by AprxKCenter is rk+1 .

Lemma 8.1.3. We have that rk+1 ≤ 2ropt (P, k), where ropt (P, k) is the radius of the optimal solution using k
balls.

Proof: Consider the k balls forming the optimal solution: D1, . . . , Dk and consider the k center points contained
in the solution S computed by AprxKCenter.
If every disk Di contain at least one point of S, then we are done, since every point of P is
in distance at most 2ropt (P, k) from one of the points of S. Indeed, if the ball Di , centered at
q, contains the point u ∈ S, then for any point p ∈ P ∩ Di , we have that

d(p, u) ≤ d(p, q) + d(q, u) ≤ 2ropt .

Otherwise, there must be two points x and y of S contained in the same ball Di of the
x y optimal solution. Let Di be centered at a point q.
We claim distance between x and y is at least rk+1 . Indeed, imagine that x was added
q at the αth iteration (that is, uα = x), and y was added in a later βth iteration (that is,
uβ = y), where α < β. Then,
ropt
rβ = d y, u1, . . . , uβ−1 ≤ d(x, y),
 

since x = uα and y = uβ . But rβ ≥ rk+1 , by Lemma 8.1.1. Applying the triangle inequality again, we have that
rk+1 ≤ rβ ≤ d(x, y) ≤ d(x, q) + d(q, y) ≤ 2ropt , implying the claim.

Theorem 8.1.4. One can approximate the k-center clustering up to a factor of two, in time O(nk).

Proof: The approximation algorithm is AprxKCenter. The approximation quality guarantee follows from
Lemma 8.1.3, since the furthest point of P from the k-centers computed is rk+1 , which is guaranteed to be
at most 2ropt .

8.2. Subset Sum

Subset Sum
Instance: X = {x1, . . . , xn } – n integer positive numbers, t - target number
Question: Is there a subset of X such the sum of its elements is t?

Subset Sum is (of course) NPC, as we already proved. It SolveSubsetSum (X, t, M)


can be solved in polynomial time if the numbers of X are b[0 . . . Mn] - boolean array init to FALSE.
small. In particular, if xi ≤ M, for i = 1, . . . , n, then t ≤ Mn // b[x] is TRUE if x can be realized by
(otherwise, there is no solution). Its reasonably easy to solve a subset of X.
in this case, as the algorithm on the right shows. The running b[0] ← TRUE.
for i = 1, . . . , n do
time of the resulting algorithm is O(Mn2 ).
for j = Mn down to xi do
Note, that M might be prohibitly large, and as such, this b[ j] ← B[ j − xi ] ∨ B[ j]
algorithm is not polynomial in n. In particular, if M = 2n then
this algorithm is prohibitly slow. Since the relevant decision return B[t]
problem is NPC, it is unlikely that an efficient algorithm exist
for this problem. But still, we would like to be able to solve it quickly and efficiently. So, if we want an efficient
solution, we would have to change the problem slightly. As a first step, lets turn it into an optimization problem.

59
Subset Sum Optimization
Instance: (X, t): A set X of n positive integers, and a target number t.
Question: The largest number γopt one can represent as a subset sum of X which is smaller or equal
to t.

Intuitively, we would like to find a subset of X such that it sum is smaller than t but very close to t.
Next, we turn problem into an approximation problem.

Subset Sum Approx


Instance: (X, t, ε): A set X of n positive integers, a target number t, and parameter ε > 0.
Question: A number z that one can represent as a subset sum of X, such that (1 − ε)γopt ≤ z ≤
γopt ≤ t.

The challenge is to solve this approximation problem efficiently. To demonstrate that there is hope that can
be done, consider the following simple approximation algorithm, that achieves a constant factor approximation.

Lemma 8.2.1. Let (X, t) be an instance of Subset Sum. Let γopt be optimal solution to given instance. Then
one can compute a subset sum that adds up to at least γopt /2 in O(n log n) time.

Proof: Add the numbers from largest to smallest, whenever adding a number will make the sum exceed t, we
throw it away. We claim that the generated sum s has the property that γopt /2 ≤ s ≤ t. Clearly, if the total
sum of the numbers is smaller than t, then no number is being rejected and s = γopt .
Otherwise, let u be the first number being rejected, and let s 0 be the partial subset sum, just before u is
being rejected. Clearly, s 0 > u > 0, s 0 < t, and s 0 + u > t. This implies t < s 0 + u < s 0 + s 0 = 2s 0, which implies
that s 0 ≥ t/2. Namely, the subset sum output is larger than t/2.

8.2.1. On the complexity of ε-approximation algorithms


Definition 8.2.2 (PTAS). For a maximization problem PROB, an algorithm A(I, ε) (i.e., A receives as input an
instance of PROB, and an approximation parameter ε > 0) is a polynomial time approximation scheme
(PTAS) if for any instance I we have

(1 − ε) opt(I) ≤ A(I, ε) ≤ opt(I) ,

where |opt(I)| denote the price of the optimal solution for I, and |A(I, ε)| denotes the price of the solution
outputted by A. Furthermore, the running time of the algorithm A is polynomial in n (the input size), when ε
is fixed.
For a minimization problem, the condition is that |opt(I)| ≤ |A(I, ε)| ≤ (1 + ε)|opt(I)|.

Example 8.2.3. An approximation algorithm with running time O(n1/ε ) is a PTAS, while an algorithm with
running time O(1/ε n ) is not.

Definition 8.2.4 (FPTAS.). An approximation algorithm is fully polynomial time approximation scheme
(FPTAS) if it is a PTAS, and its running time is polynomial both in n and 1/ε.

Example 8.2.5. A PTAS with running time O(n1/ε ) is not a FPTAS, while a PTAS with running time O(n2 /ε 3 )
is a FPTAS.

8.2.2. Approximating subset-sum

60
Let S = {a1, . . . , an } be a set of numbers. For a num- ExactSubsetSum(S, t)
ber x, let x + S denote the translation of S by x; namely, n ← |S|
x + S = {a1 + x, a2 + x, . . . an + x}. Our first step in deriving P0 ← {0}
an approximation algorithm for Subset Sum is to come up with for i = 1 . . . n do
a slightly different algorithm for solving the problem exactly. Pi ← Pi−1 ∪ (Pi−1 + xi )
The algorithm is depicted on the right. Remove from Pi all elements > t
Note, that while ExactSubsetSum performs only n itera-
tions, the lists Pi that it constructs might have exponential return largest element in Pn
size.

Trim(L 0, δ)
// L 0: inc. sorted list of #s Thus, if we would like to turn ExactSubsetSum into a faster
L = hy1 . . . ym i algorithm, we need to somehow to make the lists Ll smaller.
// yi ≤ yi+1 , for i = 1, . . . , n − 1. This would be done by removing numbers which are very close
curr ← y1 together.
Lout ← {y1 } Definition 8.2.6. For two positive real numbers z ≤ y,
for i = 2 . . . m do y
the number y is a δ-approximation to z if ≤ z ≤ y.
if yi > curr · (1 + δ) 1+δ
Append yi to Lout
The procedure Trim that trims a list L 0 so that it removes
curr ← yi
close numbers is depicted on the left.
return Lout

Observation 8.2.7. If x ∈ L 0 then there exists a number y ∈ Lout such that y ≤ x ≤ y(1 + δ), where Lout ←
Trim(L 0, δ).

We can now modify ExactSubsetSum to use Trim to ApproxSubsetSum(S, t)


keep the candidate list shorter. The resulting algorithm //Assume S = {x1, . . . , xn }, where
ApproxSubsetSum is depicted on the right. Note, that com- // x1 ≤ x2 ≤ . . . ≤ x n
puting Ei requires merging two sorted lists, which can be n ← |S|, L0 ← {0}, δ = ε/2n
done in linear time in the size of the lists (i.e., we can keep for i = 1 . . . n do
all the lists sorted, without sorting the lists repeatedly). Ei ← Li−1 ∪ (Li−1 + xi )
Let Ei be the list generated by the algorithm in the ith Li ← Trim(Ei , δ)
iteration, and Pi be the list of numbers without any trim- Remove from Li all elements > t.
ming (i.e., the set generated by ExactSubsetSum algorithm)
in the ith iteration. return largest element in Ln

Claim 8.2.8. For any x ∈ Pi there exists y ∈ Li such that y ≤ x ≤ (1 + δ)i y.

Proof: If x ∈ P1 the claim follows by Observation 8.2.7 above. Otherwise, if x ∈ Pi−1 , then, by induction,
there is y 0 ∈ Li−1 such that y 0 ≤ x ≤ (1 + δ)i−1 y 0. Observation 8.2.7 implies that there exists y ∈ Li such that
y ≤ y 0 ≤ (1 + δ)y, As such,
y ≤ y 0 ≤ x ≤ (1 + δ)i−1 y 0 ≤ (1 + δ)i y
as required.
The other possibility is that x ∈ Pi \ Pi−1 . But then x = α + xi , for some α ∈ Pi−1 . By induction, there exists
α ∈ Li−1 such that
0

α 0 ≤ α ≤ (1 + δ)i−1 α 0 .
Thus, α 0 + xi ∈ Ei and by Observation 8.2.7, there is a x 0 ∈ Li such that

x 0 ≤ α 0 + xi ≤ (1 + δ)x 0 .

61
Thus,
x 0 ≤ α 0 + xi ≤ α + xi = x ≤ (1 + δ)i−1 α 0 + xi ≤ (1 + δ)i−1 (α 0 + xi ) ≤ (1 + δ)i x 0 .
Namely, for any x ∈ Pi \ Pi−1 , there exists x 0 ∈ Li , such that x 0 ≤ x ≤ (1 + δ)i x 0.

8.2.2.1. Bounding the running time of ApproxSubsetSum


We need the following two easy technical lemmas. We include their proofs here only for the sake of completeness.

Lemma 8.2.9. For x ∈ [0, 1], it holds exp(x/2) ≤ (1 + x).

Proof: Let f (x) = exp(x/2) and g(x) = 1 + x. We have f 0(x) = exp(x/2)/2 and g 0(x) = 1. As such,

exp(x/2) exp(1/2)
f 0(x) = ≤ ≤ 1 = g 0(x), for x ∈ [0, 1].
2 2
Now, f (0) = g(0) = 1, which immediately implies the claim.
!
2 ln x ln x
Lemma 8.2.10. For 0 < δ < 1, and x ≥ 1, we have log1+δ x≤ =O .
δ δ

ln x ln x 2 ln x
Proof: We have, by Lemma 8.2.9, that log1+δ x = ≤ = .
ln(1 + δ) ln exp(δ/2) δ

Observation 8.2.11. In a list generated by Trim, for any number x, there are no two numbers in the trimmed
list between x and (1 + δ)x.

n
 2 
Lemma 8.2.12. We have |Li | = O log n , for i = 1, . . . , n.
ε

Proof: The set Li−1 + xi is a set of numbers between xi and ixi , because xi is larger than x1 . . . xi−1 and Li−1
contains subset sums of at most i − 1 numbers, each one of them smaller than xi . As such, the number of
different values in this range, stored in the list Li , after trimming is at most

ixi ln i ln n
   
log1+δ =O =O ,
xi δ δ

by Lemma 8.2.10. Thus, as δ = ε/2n, we have

ln n n ln n n log n
     2 
|Li | ≤ |Li−1 | + O ≤ |Li−1 | + O =O .
δ ε ε

 
n3
Lemma 8.2.13. The running time of ApproxSubsetSum is O ε log2 n .

Proof: Clearly, the running time of ApproxSubsetSum is dominated by the total length of the lists L1, . . . , Ln it
n
Õ  3
creates. Lemma 8.2.12 implies that |Li | = O log n . The running time of Trim is proportional to the size
i
ε
of the lists, implying the claimed running time.

62
8.2.2.2. The result
Theorem 8.2.14. ApproxSubsetSum returns a number u ≤ t, such that
γopt
≤ u ≤ γopt ≤ t,
1+ε
where γopt is the optimal solution (i.e., largest realizable subset sum smaller than t).
The running time of ApproxSubsetSum is O (n /ε) log n .
3


Proof: The running time bound is by Lemma 8.2.13.


As for the other claim, consider the optimal solution opt ∈ Pn . By Claim 8.2.8, there exists z ∈ Ln such that
z ≤ opt ≤ (1 + δ)n z. However, ε
(1 + δ)n = (1 + ε/2n)n ≤ exp ≤ 1 + ε,
2
since 1 + x ≤ e x for x ≥ 0. Thus, opt/(1 + ε) ≤ z ≤ opt ≤ t, implying that the output of ApproxSubsetSum is
within the required range.

8.3. Approximate Bin Packing


Consider the following problem.

Min Bin Packing


Instance: s1 . . . sn – n numbers in [0, 1]
Question: Q: What is the minimum number of unit bins do you need to use to store all the numbers
in S?

Bin Packing is NP-Complete because you can reduce Partition to it. Its natural to ask how one can
approximate the optimal solution to Bin Packing.
One such algorithm is next fit. Here, we go over the numbers one by one, and put a number in the current
bin if that bin can contain it. Otherwise, we create a new bin and put the number in this bin. Clearly, we need
at least
Õn
dSe bins where S = si .
i=1
Every two consecutive bins contain numbers that add up to more than 1, since otherwise we would have not
created the second bin. As such, the number of bins used is ≤ 2 dSe. As such, the next fit algorithm for bin
packing achieves a ≤ 2 dSe /dSe = 2 approximation.
A better strategy, is to sort the numbers from largest to smallest and insert them in this order, where in
each stage, we scan all current bins, and see if can insert the current number into one of those bins. If we can
not, we create a new bin for this number. This is known as first fit decreasing. We state the approximation
ratio for this algorithm without proof.
Theorem 8.3.1 ([DLHT13]). Decreasing first fit is a 11/9-approximation to Min Bin Packing. More pre-
cisely, for any instance I of the problem, one has
11 2
FFD(I) ≤ opt(I) + ,
9 3
and this is tight in the worst case. Here FFD(I) and opt(I) are the number of bins used by the first-fit decreasing
algorithm and optimal solution, respectively.

Remark 8.3.2. Note that if opt(I) = 2, then the above bound is FFD(I) ≤ 11 2 28 1
9 2 + 3 = 9 = 3 9 , Which means that
in this case this approach could yield a solution with three bins, which is not exciting.

63
The above paper is almost 50 pages long, and is not easy. The coefficient 11/9 was proved by David S.
Johnson in his PhD thesis in 1973 (who also authored [GJ90]), but the exact value of the additive constant was
only settled by [DLHT13].

Remark 8.3.3. Note, that the above algorithm is not a multiplicative approximation (note the +2/3 term). In
particular, getting a 3/2-approximation is hard because of the reduction from Partition – there the decision
boils down to whether the instance generated from partition requires two bins or three bins. As such, any
multiplicative approximation better than 3/2 is impossible unless P = NP.

8.4. Bibliographical notes


One can do 2-approximation for the k-center clustering in low dimensional Euclidean space can be done in
Θ(n log k) time [FG88]. In fact, it can be solved in linear time [Har04].

64
Part IV
Randomized algorithms

Chapter 9

Randomized Algorithms

9.1. Some Probability


Definition 9.1.1. (Informal.) A random variable is a measurable function from a probability space to (usually)
real numbers. It associates a value with each possible atomic event in the probability space.

Definition 9.1.2. The conditional probability of X given Y is

P (X = x) ∩ (Y = y)
 
P[X = x |Y = y ] = .
PY=y
 

An equivalent and useful restatement of this is that


h i
P[(X = x) ∩ (Y = y)] = P X = x Y = y ∗ P[Y = y].

Definition 9.1.3. Two events X and Y are independent, if P[X = x ∩ Y = y] = P[X = x] · P[Y = y]. In particular,
if X and Y are independent, then
h i
P X = x Y = y = P[X = x].

Definition 9.1.4. The expectation of a random variable X is the average value of this random variable. Formally,
if X has a finite (or countable) set of values, it is
Õ
E[X] = x · P[X = x],
x

where the summation goes over all the possible values of X.

One of the most powerful properties of expectations is that an expectation of a sum is the sum of expectations.

X Y X Y
 
Lemma 9.1.5 (Linearity of expectation.). For any two random variables and , we have E + =
E X +E Y .
   

65
Proof: For the simplicity of exposition, assume that X and Y receive only integer values. We have that
ÕÕ
E[X + Y ] = (x + y) P[(X = x) ∩ (Y = y)]
x y
ÕÕ ÕÕ
= x ∗ P[(X = x) ∩ (Y = y)] + y ∗ P[(X = x) ∩ (Y = y)]
x y x y
Õ Õ Õ Õ
= x∗ P[(X = x) ∩ (Y = y)] + y∗ P[(X = x) ∩ (Y = y)]
x y y x
Õ Õ
= x ∗ P[X = x] + y ∗ P[Y = y]
x y

= E[X] + E[Y ] .

Another interesting function is the conditional expectation – that is, it is the expectation of a random
variable given some additional information.

h i 9.1.6. Given random variables X and Y , the conditional expectation of Xh giveni Y , ishthe quantity
Definition i
E X Y . Specifically, you are given the value y of the random variable Y , and the E X Y = E X Y = y =
h i
x X x Y
Í
x ∗ P = = y .

Note, that for a random


h variable
i X, the expectation E[X] is a number. On the other hand, the conditional
probability f (y) = E X Y = y is a function. The key insight why conditional probability is the following.

Lemma
h h 9.1.7.
ii For any two random variables X and Y (not necessarily independent), we have that E[X] =
E E X Y .

Proof: We use the definitions carefully:


" #
h h ii h h ii Õ h i
E E X Y =E E X Y =y =E x∗P X = x Y = y
y y
x
!
Õ Õ h i
= P[Y = y] ∗ x∗P X = x Y = y
y x
!
Õ P[(X = x) ∩ (Y = y)]
Õ
= P[Y = y] ∗ x∗
y x P[Y = y]
ÕÕ ÕÕ
= x ∗ P[(X = x) ∩ (Y = y)] = x ∗ P[(X = x) ∩ (Y = y)]
y x x y
!
Õ Õ Õ
= x∗ P[(X = x) ∩ (Y = y)] = x ∗ P[X = x] = E[X] .
x y x

9.2. Sorting Nuts and Bolts


Problem 9.2.1 (Sorting Nuts and Bolts). You are given a set of n nuts and n bolts. Every nut have a matching
bolt, and all the n pairs of nuts and bolts have different sizes. Unfortunately, you get the nuts and bolts
separated from each other and you have to match the nuts to the bolts. Furthermore, given a nut and a bolt,
all you can do is to try and match one bolt against a nut (i.e., you can not compare two nuts to each other, or
two bolts to each other).

66
When comparing a nut to a bolt, either they match, or one is smaller than other (and you known the
relationship after the comparison).
How to match the n nuts to the n bolts quickly? Namely, while performing a small number of comparisons.

The naive algorithm is of course to compare each nut to MatchNutsAndBolts(N: nuts, B: bolts)
each bolt, and match them together. This would require a Pick a random nut n pivot from N
quadratic number of comparisons. Another option is to sort Find its matching bolt b pivot in B
the nuts by size, and the bolts by size and then “merge” the BL ← All bolts in B smaller than n pivot
two ordered sets, matching them by size. The only problem is NL ← All nuts in N smaller than b pivot
that we can not sorts only the nuts, or only the bolts, since we BR ← All bolts in B larger than n pivot
can not compare them to each other. Indeed, we sort the two NR ← All nuts in N larger than b pivot
sets simultaneously, by simulating QuickSort. The resulting MatchNutsAndBolts(NR ,BR )
algorithm is depicted on the right. MatchNutsAndBolts(NL ,BL )

9.2.1. Running time analysis


Definition 9.2.2. Let RT denote the random variable which is the running time of the algorithm. Note, that the
running time is a random variable as it might be different between different executions on the same input.

Definition 9.2.3. For a randomized algorithm, we can speak about the expected running time. Namely, we are
interested in bounding the quantity E[RT] for the worst input.

Definition 9.2.4. The expected running-time of a randomized algorithm for input of size n is

T(n) =
 
max E RT(U) ,
U is an input of size n

where RT(U) is the running time of the algorithm for the input U.

Definition 9.2.5. The rank of an element x in a set S, denoted by rank(x), is the number of elements in S of size
smaller or equal to x. Namely, it is the location of x in the sorted list of the elements of S.

Theorem 9.2.6. The expected running time of MatchNutsAndBolts (and thus also of QuickSort) is T(n) =
O(n log n), where n is the number of nuts and bolts. The worst case running time of this algorithm is O(n2 ).

Proof: Clearly, we have that P rank(n pivot ) = k = n1 . Furthermore, if the rank of the pivot is k then
 

T(n) = E [O(n) + T(k − 1) + T(n − k)] = O(n) + E[T(k − 1) + T(n − k)]


k=rank(n pi v ot ) k
n
Õ
= T(n) = O(n) + P[Rank(Pivot) = k] ∗ (T(k − 1) + T(n − k))
k=1
n
Õ 1
= O(n) + · (T(k − 1) + T(n − k)),
k=1
n

by the definition of expectation. It is not easy to verify that the solution to the recurrence T(n) = O(n) +
Ín 1
k=1 n ·
(T(k − 1) + T(n − k)) is O(n log n).

67
9.2.1.1. Alternative incorrect solution
The algorithm MatchNutsAndBolts is lucky if n4 ≤ rank(n pivot ) ≤ 34 n. Thus, P[“lucky”] = 1/2. Intuitively, for
the algorithm to be fast, we want the split to be as balanced as possible. The less balanced the cut is, the worst
the expected running time. As such, the “Worst” lucky position is when rank(n pivot ) = n/4 and we have that

T(n) ≤ O(n) + P[“lucky”] ∗ (T(n/4) + T(3n/4)) + P[“unlucky”] ∗ T(n).

Namely, T(n) = O(n)+ 12 ∗ T( n4 ) + T( 34 n) + 12 T(n). Rewriting, we get the recurrence T(n) = O(n)+T(n/4)+T((3/4)n),


and its solution is O(n log n).


While this is a very intuitive and elegant solution that bounds the running time of QuickSort, it is also
incomplete. The interested reader should try and make this argument complete. After completion the argument
is as involved as the previous argument. Nevertheless, this argumentation gives a good back of the envelope
analysis for randomized algorithms which can be applied in a lot of cases.

9.2.2. What are randomized algorithms?


Randomized algorithms are algorithms that use random numbers (retrieved usually from some unbiased source
of randomness [say a library function that returns the result of a random coin flip]) to make decisions during
the executions of the algorithm. The running time becomes a random variable. Analyzing the algorithm would
now boil down to analyzing the behavior of the random variable RT(n), where n denotes the size of the input.In
particular, the expected running time E[RT(n)] is a quantity that we would be interested in.
It is useful to compare the expected running time of a randomized algorithm, which is

T(n) = max E[RT(U)] ,


U is an input of size n

to the worst case running time of a deterministic (i.e., not randomized) algorithm, which is

T(n) = max RT(U),


U is an input of size n

Caveat Emptor:¬ Note, that a randomized algorithm might have exponen- FlipCoins
tial running time in the worst case (or even unbounded) while having good while RandBit= 1 do
expected running time. For example, consider the algorithm FlipCoins de- nothing;
picted on the right. The expected running time of FlipCoins is a geometric
random variable with probability 1/2, as such we have that E[RT(FlipCoins)] = O(2). However, FlipCoins can
run forever if it always gets 1 from the RandBit function.
This is of course a ludicrous argument. Indeed, the probability that FlipCoins runs for long decreases very
quickly as the number of steps increases. It can happen that it runs for long, but it is extremely unlikely.

Definition 9.2.7. The running time of a randomized algorithm Alg is O( f (n)) with high probability if

P[RT(Alg(n)) ≥ c · f (n)] = o(1).

Namely, the probability of the algorithm to take more than O( f (n)) time decreases to 0 as n goes to infinity. In
our discussion, we would use the following (considerably more restrictive definition), that requires that
1
P[RT(Alg(n)) ≥ c · f (n)] ≤ ,
nd
where c and d are appropriate constants. For technical reasons, we also require that E[RT(Alg(n))] = O( f (n)).
¬ Caveat Emptor - let the buyer beware (i.e., one buys at one’s own risk)

68
9.3. Analyzing QuickSort
The previous analysis works also for QuickSort. However, there is an alternative analysis which is also very
interesting and elegant. Let a1, ..., an be the n given numbers (in sorted order – as they appear in the output).
It is enough to bound the number of comparisons performed by QuickSort to bound its running time, as
can be easily verified. Observe, that two specific elements are compared to each other by QuickSort at most
once, because QuickSort performs only comparisons against the pivot, and after the comparison happen, the
pivot does not being passed to the two recursive subproblems.
Let Xi j be an indicator variable if QuickSort compared ai to a j in the current execution, and zero otherwise.
The number of comparisons performed by QuickSort is exactly Z = i< j Xi j .
Í

Observation 9.3.1. The element ai is compared to a j iff one of them is picked to be the pivot and they are
still in the same subproblem.

Also, we have that µ = E Xi j = P Xi j = 1 . To quantify this probability, observe that if the pivot is smaller
   

than ai or larger than a j then the subproblem still contains the block of elements ai , . . . , a j . Thus, we have that

2
µ = P ai or a j is first pivot ∈ ai , . . . , a j =
 
.
j −i+1

Another (and hopefully more intuitive) explanation for the above phenomena is the following: Imagine, that
before running QuickSort we choose for every element a random priority, which is a real number in the range
[0, 1]. Now, we reimplement QuickSort such that it always pick the element with the lowest random priority (in
the given subproblem) to be the pivot. One can verify that this variant and the standard implementation have
the same running time. Now, ai gets compares to a j if and only if all the elements ai+1, . . . , a j−1 have random
priority larger than both the random priority of ai and the random priority of a j . But the probability that one
of two elements would have the lowest random-priority out of j − i + 1 elements is 2 ∗ 1/( j − i + 1), as claimed.
Thus, the running time of QuickSort is
" # n−1 Õ n
Õ Õ   Õ 2 Õ 1
Xi j = E Xi j =
 
E RT(n) = E =2
i< j i< j i< j
j −i+1 i=1 j=i+1
j −i+1
n−1 n−i+1 n−1 Õn n−1
Õ Õ 1 Õ 1 Õ
=2 ≤2 ≤2 Hn = 2nHn .
i=1 ∆=2
∆ i=1 ∆=1
∆ i=1

Ín 1
by linearity of expectations, where Hn = i=1 ≤ ln n + 1 is the nth harmonic number,
i
As we will see in the near future, the running time of QuickSort is O(n log n) with high-probability. We need
some more tools before we can show that.

9.4. QuickSelect – median selection in linear time


Consider the problem of given a set X of n numbers, and a parameter k, to output the kth smallest number
(which is the number with rank k in X). This can be easily be done by modifying QuickSort only to perform
one recursive call. See Figure 9.1 for a pseud-code of the resulting algorithm.
Intuitively, at each iteration of QuickSelect the input size shrinks by a constant factor, leading to a linear
time algorithm.

Theorem 9.4.1. Given a set X of n numbers, and any integer k, the expected running time of QuickSelect(X, n)
is O(n).

69
QuickSelect(X, k)
// Input: X = {x1, . . . , xn } numbers, k.
// Assume x1, . . . , xn are all distinct.
// Task: Return kth smallest number in X.
y ← random element of X.
r ← rank of y in X.
if r = k then return y
X< = all elements in X < than y
X> = all elements in X > than y
// By assumption |X< | + |X> | + 1 = |X |.
if r < k then
return QuickSelect( X> , k − r )
else
return QuickSelect( X ≤ , k )

Figure 9.1: QuickSelect pseudo-code.

Proof: Let X1 = X, and Xi be the set of numbers in the ith level of the recursion. Let yi and ri be the random
element and its rank in Xi , respectively, in the ith iteration of the algorithm. Finally, let ni = |Xi |. Observe
that the probability that the pivot yi is in the “middle” of its subproblem is

ni
 
3 1
α=P ≤ ri ≤ ni ≥ ,
4 4 2

and if this happens then


3
ni+1 ≤ max(ri − 1, ni − ri ) ≤ ni .
4
We conclude that
h i 3
E ni+1 ni ≤ P[yi in the middle] ni + P[yi not in the middle]ni
4
3
≤ α ni + (1 − α)ni = ni (1 − α/4) ≤ ni (1 − (1/2)/4) = (7/8)ni .
4
Now, we have that
h h ii
mi+1 = E[ni+1 ] = E E ni+1 ni ≤ E[(7/8)ni ] = (7/8) E[ni ] = (7/8)mi
= (7/8)i m0 = (7/8)i n,
h h ii
since for any two random variables we have that E[X] = E E X Y . In particular, the expected running time
of QuickSelect is proportional to
" #
Õ Õ Õ Õ
E ni = E[ni ] ≤ mi = (7/8)i n = O(n),
i i i i

as desired.

70
Chapter 10

Randomized Algorithms II

10.1. QuickSort and Treaps with High Probability


You must be asking yourself what are treaps. For the answer, see Section 10.3p76 .
One can think about QuickSort as playing a game in rounds. Every round, QuickSort picks a pivot, splits
the problem into two subproblems, and continue playing the game recursively on both subproblems.
If we track a single element in the input, we see a sequence of rounds that involve this element. The game
ends, when this element find itself alone in the round (i.e., the subproblem is to sort a single element).
Thus, to show that QuickSort takes O(n log n) time, it is enough to show, that every element in the input,
participates in at most 32 ln n rounds with high enough probability.
Indeed, let Xi be the event that the ith element participates in more than 32 ln n rounds.
Let CQS be the number of comparisons performed by QuickSort. A comparison between a pivot and an
element will be always charged to the element. And as such, the number of comparisons overall performed by
QuickSort is bounded by i ri , where ri is the number of rounds the ith element participated in (the last round
Í
where it was a pivot is ignored). We have that
" # n
Ø Õ
α = P CQS ≥ 32n ln n ≤ P Xi ≤
 
P[Xi ].
i i=1

Here, we used the union bound ¬ , that states that for any two events A and B, we have that P[A ∪ B] ≤
P[A] + P[B]. Assume, for the time being, that P[Xi ] ≤ 1/n3 . This implies that
n n
Õ Õ 1 1
α≤ P[Xi ] ≤ = 2.
i=1 i=1
n 3 n

Namely, QuickSort performs at most 32n ln n comparisons with high probability. It follows, that QuickSort
runs in O(n log n) time, with high probability, since the running time of QuickSort is proportional to the number
of comparisons it performs.
To this end, we need to prove that P[Xi ] ≤ 1/n3 .

10.1.1. Proving that an element participates in small number of rounds


Consider a run of QuickSort for an input made out of n numbers. Consider a specific element x in this input,
and let S1, S2, . . . be the subsets of the input that are in the recursive calls that include the element x. Here S j
is the set of numbers in the jth round (i.e., this is the recursive call at depth j which includes x among the
numbers it needs to sort).
The element x would be considered to be lucky, in the jth iteration, if the call to the QuickSort, splits the
current set S j into two parts, where both parts contains at most (3/4) S j of the elements.
Let Yj be an indicator variable which is 1 if and only if x is lucky in jth round. Formally, Yj = 1 if and only
if S j /4 ≤ S j+1 ≤ 3 S j /4. By definition, we have that
  1
P Yj = .
2
¬ Also known as Boole’s inequality.

71
Furthermore, Y1,Y2, . . . ,Ym are all independent variables.
Note, that x can participate in at most

ρ = log4/3 n ≤ 3.5 ln n (10.1)

rounds, since at each successful round, the number of elements in the subproblem shrinks by at least a factor
3/4, and |S1 | = n. As such, if there are ρ successful rounds in the first k rounds, then |Sk | ≤ (3/4)ρ n ≤ 1.
Thus, the question of how many rounds x participates in, boils down to how many coin flips one need to
perform till one gets ρ heads. Of course, in expectation, we need to do this 2ρ times. But what if we want a
bound that holds with high probability, how many rounds are needed then?
In the following, we require the following lemma, which we will prove in Section 10.2.

Lemma 10.1.1. In a sequence of M coin flips, the probability that the number of ones is smaller than L ≤ M/4
is at most exp(−M/8).

To use Lemma 10.1.1, we set

M = 32 ln n ≥ 8ρ,

 Yj be
see Eq. (10.1). Let  which is one if x is lucky in the jth level of recursion, and zero otherwise.
 the variable
We have that P Yj = 0 = P Yj = 1 = 1/2 and that Y1,Y2, . . . ,YM are independent. By Lemma 10.1.1, we have
that the probability that there are only ρ ≤ M/4 ones in Y1, . . . ,YM , is smaller than

M
 
1
exp − ≤ exp(−ρ) ≤ 3 .
8 n

We have that the probability that x participates in M recursive calls of QuickSort to be at most 1/n3 .
There are n input elements. Thus, the probability that depth of the recursion in QuickSort exceeds 32 ln n
is smaller than (1/n3 ) ∗ n = 1/n2 . We thus established the following result.

Theorem 10.1.2. With high probability (i.e., 1 − 1/n2 ) the depth of the recursion of QuickSort is ≤ 32 ln n.
Thus, with high probability, the running time of QuickSort is O(n log n).
More generally, for any constant c, there exist a constant d, such that the probability that QuickSort recursion
depth for any element exceeds d ln n is smaller than 1/nc .
Specifically, for any t ≥ 1, we have that probability that the recursion depth for any element exceeds t · d ln n
is smaller than 1/nt ·c .

Proof: Let us do the last part (but the reader is encouraged to skip this on first reading). Setting M = 32t ln n,
we get that the probability that an element has depth exceeds M, requires that in M coin flips we get at most
h = 4 ln n heads. That is, if Y is the sum of the coin flips, where we get +1 for head, and −1 for tails, then Y needs
to be smaller than −(M − h) + h = −M + 2h. By symmetry, this is equal to the probability that Y ≥ ∆ = M − 2h.
By Theorem 10.2.3 below, the probability for that is

(M − 2h)2 (32t − 8)2 ln2 n


     
2
P [Y ≥ ∆] ≤ exp −∆ /2M = exp − = exp −
2M 128t ln n
2
(4t − 1) ln n 3t ln n
2
   
1
= exp − ≤ exp − ≤ 3t .
2t t n

Of course, the same result holds for the algorithm MatchNutsAndBolts for matching nuts and bolts.

72
10.1.2. An alternative proof of the high probability of QuickSort
Consider a set T of the n items to be sorted, and consider a specific element t ∈ T. Let Xi be the size of the
input in the ith level of recursion that contains t. We know that X0 = n, and
h i 13 1 7
E Xi Xi−1 ≤ Xi−1 + Xi−1 ≤ Xi−1 .
24 2 8
Indeed, with probability 1/2 the pivot is the middle of the subproblem; that is, its rank is between Xi−1 /4 and
(3/4)Xi−1 (and then the subproblem has size ≤ Xi−1 (3/4)), and with probability 1/2 the subproblem might has
not shrank significantly (i.e., we pretend it did not shrink at all).  
Now, observe that for any two random variables we have that E X = Ey E X Y = y , see Lemma 9.1.7p66 ..
  

As such, we have that


   i  i
h h ii 7 7 7 7
E[Xi ] = E E Xi Xi−1 = y ≤ E y = E[Xi−1 ] ≤ E[X0 ] = n.
y Xi−1 =y 8 8 8 8
In particular, consider M = 8 log8/7 n. We have that
 M
7 1 1
µ = E[XM ] ≤ n ≤ 8n = 7.
8 n n
Of course, t participates in more than M recursive calls, if and only if XM ≥ 1. However, by Markov’s
inequality (Theorem 10.2.1), we have that
element t participates
 
E[XM ] 1
P ≤ P[XM ≥ 1] ≤ ≤ 7,
in more than M recursive calls 1 n
as desired. That is, we proved that the probability that any element of the input T participates in more than
M recursive calls is at most n(1/n7 ) ≤ 1/n6 .

10.2. Chernoff inequality


10.2.1. Preliminaries
Theorem 10.2.1 (Markov’s Inequality.). For a non-negative variable X, and t > 0, we have:
E[X]
P[X ≥ t] ≤ .
t
E[X]
Proof: Assume that this is false, and there exists t0 > 0 such that P[X ≥ t0 ] > . However,
t0
Õ Õ Õ
E[X] = x · P[X = x] = x · P[X = x] + x · P[X = x]
x x<t0 x ≥t0
E[X]
≥ 0 + t0 · P[X ≥ t0 ] > 0 + t0 · = E[X] ,
t0
a contradiction.

We remind the reader that two random variables X and Y are independent if for all x, y we have that

P[(X = x) ∩ (Y = y)] = P[X = x] · P[Y = y].


The following claim is easy to verify, and we omit the easy proof.
Claim 10.2.2. If X and Y are independent, then E[XY ] = E[X] E[Y ].
If X and Y are independent then Z = e X and W = eY are also independent variables.

73
10.2.2. Chernoff inequality
Theorem 10.2.3 (Chernoff inequality). Let X1, . . . , Xn be n independent random variables, such that P[Xi = 1] =
P[Xi = −1] = 12 , for i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
Ín
 
2
P[Y ≥ ∆] ≤ exp −∆ /2n .

Proof: Clearly, for an arbitrary t, to be specified shortly, we have


 
E exp(tY )
P[Y ≥ ∆] = P tY ≥ t∆ = P exp(tY ) ≥ exp(t∆) ≤
   
, (10.2)
exp(t∆)
where the first part follows since exp(·) preserve ordering, and the second part follows by Markov’s inequality
(Theorem 10.2.1).
Observe that, by the definition of E[·] and by the Taylor expansion of exp(·), we have
1 t 1 −t et + e−t
E exp(t Xi ) e + e =
 
=
2  2 2
t t 2 t3

1
= 1+ + + +···
2 1! 2! 3!
t t2 t3
 
1
+ 1− + − +···
2 1! 2! 3!
t 2 t 2k
 
= 1+ ++··· + +··· .
2! (2k)!
Now, (2k)! = k!(k + 1)(k + 2) · · · 2k ≥ k!2k , and thus
∞ ∞ ∞  i
t 2i t 2i 1 t2 t
 2
 Õ Õ Õ
E exp(t Xi ) =

≤ = = exp ,
i=0
(2i)! i=0 2 (i!) i=0 i! 2
i 2
again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
" !# " # n
Õ Ö Ö
t X X E exp(t Xi )
   
E exp(tY ) = E exp i = E exp(t i ) =
i i i=1
n
t2 nt
Ö 
  2
≤ exp = exp .
i=1
2 2

We have, by Eq. (10.2), that


 
nt 2
exp
 
nt
 2
E exp(tY )

2
P[Y ≥ ∆] ≤ ≤ = exp − t∆ .
exp(t∆) exp(t∆) 2
Next, we select the value of t that minimizes the right term in the above inequality. Easy calculation shows
that the right value is t = ∆/n. We conclude that
 2 !
n ∆
 2
∆ ∆
P[Y ≥ ∆] ≤ exp − ∆ = exp − .
2 n n 2n

Note, the above theorem states that


n n n
∆2
Õ Õ  
i
P[Y ≥ ∆] = P[Y = i] = ≤ exp − ,
i=∆
2n 2n
i=n/2+∆/2

since Y = ∆ means that we got n/2 + ∆/2 times +1s and n/2 − ∆/2 times (−1)s.
By the symmetry of Y , we get the following corollary.

74
Corollary 10.2.4. Let X1, . . . , Xn be n independent random variables, such that P[Xi = 1] = P[Xi = −1] = 12 , for
i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
Ín
 2

P[|Y | ≥ ∆] ≤ 2 exp − .
2n
By easy manipulation, we get the following result.
Corollary 10.2.5. Let X1, . . . , Xn be n independent coin flips, such that P[Xi = 1] = P[Xi = 0] = 1
2, for i =
1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
Ín

hn i 
2∆2
 h n i 
2∆2

P − Y ≥ ∆ ≤ exp − and P Y − ≥ ∆ ≤ exp − .
2 n 2 n
n 2∆2
h i  
In particular, we have P Y − ≥ ∆ ≤ 2 exp − .
2 n
Proof: Transform Xi into the random variable Zi = 2Xi − 1, and now use Theorem 10.2.3 on the new random
variables Z1, . . . , Zn .

Lemma 10.1.1 (Restatement.) In a sequence of M coin flips, the probability that the number of ones is smaller
than L ≤ M/4 is at most exp(−M/8).

Proof: Let Y = i=1 Xi the sum of the M coin flips. By the above corollary, we have:
Ím

M M M
   
P[Y ≤ L] = P −Y ≥ −L =P −Y ≥ ∆ ,
2 2 2
where ∆ = M/2 − L ≥ M/4. Using the above Chernoff inequality, we get
2∆2
 
P[Y ≤ L] ≤ exp − ≤ exp(−M/8).
M

10.2.2.1. The Chernoff Bound — General Case


Here we present the Chernoff bound in a more general settings.
Problem 10.2.6. Let X1, . . . Xn be n independent Bernoulli trials, where
P[Xi = 1] = pi and P[Xi = 0] = 1 − pi ,
and let denote
Õ
Y= Xi µ = E[Y ] .
i

Question: what is the probability that Y ≥ (1 + δ)µ.


Theorem 10.2.7 (Chernoff inequality). For any δ > 0,



P[Y > (1 + δ)µ] < .
(1 + δ)1+δ
Or in a more simplified form, for any δ ≤ 2e − 1,
 
P[Y > (1 + δ)µ] < exp −µδ2 /4 , (10.3)
and
P[Y > (1 + δ)µ] < 2−µ(1+δ),
for δ ≥ 2e − 1.

75
Theorem 10.2.8. Under the same assumptions as the theorem above, we have

δ2
 
P [Y < (1 − δ)µ] ≤ exp −µ .
2

The proofs of those more general form, follows the proofs shown above, and are omitted. The interested
reader can get the proofs from:

http://www.uiuc.edu/~sariel/teach/2002/a/notes/07_chernoff.ps

10.3. Treaps
Anybody that ever implemented a balanced binary tree, knows that it can be very painful. A natural question,
is whether we can use randomization to get a simpler data-structure with good performance.

10.3.1. Construction
The key observation is that many of data-structures that offer good performance for balanced binary search
trees, do so by storing additional information to help in how to balance the tree. As such, the key Idea is that
for every element x inserted into the data-structure, randomly choose a priority p(x); that is, p(x) is chosen
uniformly and randomly in the range [0, 1].
So, for the set of elements X = {x1, . . . , xn }, with (random) priorities p(x1 ), . . . , p(xn ), our purpose is to build
a binary tree which is “balanced”. So, let us pick the element xk with the lowest priority in X, and make it the
root of the tree. Now, we partition X in the natural way:
(A) L: set of all the numbers smaller than xk in X, and
(B) R: set of all the numbers larger than xk in X. xk
p(xk )
We can now build recursively the trees for L and R, and let denote them by
T L and TR . We build the natural tree, by creating a node for xk , having T L its
left child, and TR as its right child.
We call the resulting tree a treap. As it is a tree over the elements, and a
heap over the priorities; that is, treap = tree + heap.
TL TR
Lemma 10.3.1. Given n elements, the expected depth of a treap T defined over those elements is O(log(n)).
Furthermore, this holds with high probability; namely, the probability that the depth of the treap would exceed
c log n is smaller than δ = n−d , where d is an arbitrary constant, and c is a constant that depends on d.­
Furthermore, the probability that T has depth larger than ct log(n), for any t ≥ 1, is smaller than n−dt .

Proof: Observe, that every element has equal probability to be in the root of the treap. Thus, the structure
of a treap, is identical to the recursive tree of QuickSort. Indeed, imagine that instead of picking the pivot
uniformly at random, we instead pick the pivot to be the element with the lowest (random) priority. Clearly,
these two ways of choosing pivots are equivalent. As such, the claim follows immediately from our analysis of
the depth of the recursion tree of QuickSort, see Theorem 10.1.2p72 .

10.3.2. Operations
The following innocent observation is going to be the key insight in implementing operations on treaps:

Observation 10.3.2. Given n distinct elements, and their (distinct) priorities, the treap storing them is
uniquely defined.
­ That is, if we want to decrease the probability of failure, that is δ, we need to increase c.

76
10.3.2.1. Insertion
Given an element x to be inserted into an existing treap T, insert it in the usual way into T (i.e., treat it a
regular search binary tree). This takes O(height(T)). Now, x is a leaf in the treap. Set x priority p(x) to some
random number [0, 1]. Now, while the new tree is a valid search tree, it is not necessarily still a valid treap, as
x’s priority might be smaller than its parent. So, we need to fix the tree around x, so that the priority property
holds.
We call RotateUp(x) to do so. Specifically, if x parent is y, and p(x) < RotateUp(x)
p(y), we will rotate x up so that it becomes the parent of y. We repeatedly y ← parent(x)
do it till x has a larger priority than its parent. The rotation operation while p(y) > p(x) do
takes constant time and plays around with priorities, and importantly, it if y.left_child = x then
preserves the binary search tree order. Here is a rotate right operation RotateRight(y)
RotateRight(D): else
x
RotateLeft(y)
D
0.3 0.2 y ← parent(x)

x E A D
0.2 0.4 0.6 0.3

=⇒
A C C E
0.6 0.5 0.5 0.4

RotateLeft is the same tree rewriting operation done in the other direction.
In the end of this process, both the ordering property and the priority property holds. That is, we have a
valid treap that includes all the old elements, and the new element. By Observation 10.3.2, since the treap is
uniquely defined, we have updated the treap correctly. Since every time we do a rotation the distance of x from
the root decrease by one, it follows that insertions takes O(height(T)).

10.3.2.2. Deletion
Deletion is just an insertion done in reverse. Specifically, to delete an element x from a treap T, set its priority
to +∞, and rotate it down it becomes a leaf. The only tricky observation is that you should rotate always so
that the child with the lower priority becomes the new parent. Once x becomes a leaf deleting it is trivial - just
set the pointer pointing to it in the tree to null.

10.3.2.3. Split
Given an element x stored in a treap T, we would like to split T into two treaps – one treap T ≤ for all the
elements smaller or equal to x, and the other treap T> for all the elements larger than x. To this end, we set
x priority to −∞, fix the priorities by rotating x up so it becomes the root of the treap. The right child of x
is the treap T> , and we disconnect it from T by setting x right child pointer to null. Next, we restore x to its
real priority, and rotate it down to its natural location. The resulting treap is T ≤ . This again takes time that
is proportional to the depth of the treap.

10.3.2.4. Meld
Given two treaps T L and TR such that all the elements in T L are smaller than all the elements in TR , we would
like to merge them into a single treap. Find the largest element x stored in T L (this is just the element stored
in the path going only right from the root of the tree). Set x priority to −∞, and rotate it up the treap so that

77
it becomes the root. Now, x being the largest element in T L has no right child. Attach TR as the right child of
x. Now, restore x priority to its original priority, and rotate it back so the priorities properties hold.

10.3.3. Summery
Theorem 10.3.3. Let T be a treap, initialized to an empty treap, and undergoing a sequence of m = nc inser-
tions, where c is some constant. The probability that the depth of the treap in any point in time would exceed
d log n is ≤ 1/n f , where d is an arbitrary constant, and f is a constant that depends only c and d.
In particular, a treap can handle insertion/deletion in O(log n) time with high probability.

Proof: Since the first part of the theorem implies that with high probability all these treaps have logarithmic
depth, then this implies that all operations takes logarithmic time, as an operation on a treap takes at most
the depth of the treap.
As for the first part, let T1, . . . , Tm be the sequence of treaps, where Ti is the treap after the ith operation.
Similarly, let Xi be the set of elements stored in Ti . By Lemma 10.3.1, the probability that Ti has large depth
is tiny. Specifically, we have that

log nc
   
0 0 1
αi = P[depth(Ti ) > tc log n ] = P depth(Ti ) > c t
c
· log |Ti | ≤ t ·c ,
log |Ti | n

as a tedious and boring but straightforward calculation shows. Picking t to be sufficiently large, we have that
the probability that the ith treap is too deep is smaller than 1/n f +c . By the union bound, since there are nc
treaps in this sequence of operations, it follows that the probability of any of these treaps to be too deep is at
most 1/n f , as desired.

10.4. Bibliographical Notes


Chernoff inequality was a rediscovery of Bernstein inequality, which was published in 1924 by Sergei Bernstein.
Treaps were invented by Siedel and Aragon [SA96]. Experimental evidence suggests that Treaps performs
reasonably well in practice, despite their simplicity, see for example the comparison carried out by Cho and
Sahni [CS00]. Implementations of treaps are readily available. An old implementation I wrote in C is available
here: http://valis.cs.uiuc.edu/blog/?p=6060.

Chapter 11

Hashing

“I tried to read this book, Huckleberry Finn, to my grandchildren, but I couldn’t get past page six because the book is
fraught with the ‘n-word.’ And although they are the deepest-thinking, combat-ready eight- and ten-year-olds I know, I
knew my babies weren’t ready to comprehend Huckleberry Finn on its own merits. That’s why I took the liberty to rewrite
Mark Twain’s masterpiece. Where the repugnant ‘n-word’ occurs, I replaced it with ‘warrior’ and the word ‘slave’ with
‘dark-skinned volunteer.”’

Paul Beatty, The Sellout

78
y f

Figure 11.1: Open hashing.

11.1. Introduction
We are interested here in dictionary data structure. The settings for such a data-structure:
(A) U: universe of keys with total order: numbers, strings, etc.
(B) Data structure to store a subset S ⊆ U
(C) Operations:
(A) search/lookup: given x ∈ U is x ∈ S?
(B) insert: given x < S add x to S.
(C) delete: given x ∈ S delete x from S
(D) Static structure: S given in advance or changes very infrequently, main operations are lookups.
(E) Dynamic structure: S changes rapidly so inserts and deletes as important as lookups.

Common constructions for such data-structures, include using a static sorted array, where the lookup is a
binary search. Alternatively, one might use a balanced search tree (i.e., red-black tree). The time to perform
an operation like lookup, insert, delete take O(log |S|) time (comparisons).
Naturally, the above are potently an “overkill”, in the sense that sorting is unnecessary. In particular, the
universe U may not be (naturally) totally ordered. The keys correspond to large objects (images, graphs etc)
for which comparisons are expensive. Finally, we would like to improve “average” performance of lookups to
O(1) time, even at cost of extra space or errors with small probability: many applications for fast lookups in
networking, security, etc.

Hashing and Hash Tables. The hash-table data structure has an associated (hash) table/array T of size m
(the table size). A hash function h : U → {0, . . . , m − 1}. An item x ∈ U hashes to slot h(x) in T.
Given a set S ⊆ U, in a perfect ideal situation, each element x ∈ S hashes to a distinct slot in T, and we
store x in the slot h(x). The Lookup for an item y ∈ U, is to check if T[h(y)] = y. This takes constant time.
Unfortunately, collisions are unavoidable, and several different techniques to handle them. Formally, two
items x , y collide if h(x) = h(y).
A standard technique to handle collisions is to use chaining (aka open hashing). Here, we handle collisions
as follows:
(A) For each slot i store all items hashed to slot i in a linked list. T[i] points to the linked list.
(B) Lookup: to find if y ∈ U is in T, check the linked list at T[h(y)]. Time proportion to size of linked list.
Other techniques for handling collisions include associating a list of locations where an element can be (in
certain order), and check these locations in this order. Another useful technique is cuckoo hashing which we
will discuss later on: Every value has two possible locations. When inserting, insert in one of the locations,
otherwise, kick out the stored value to its other location. Repeat till stable. if no stability then rebuild table.
The relevant questions when designing a hashing scheme, include: (I) Does hashing give O(1) time per
operation for dictionaries? (II) Complexity of evaluating h on a given element? (III) Relative sizes of the
universe U and the set to be stored S. (IV) Size of table relative to size of S. (V) Worst-case vs average-case
vs randomized (expected) time? (VI) How do we choose h?
The load factor of the array T is the ratio n/t where n = |S| is the number of elements being stored and
m = |T | is the size of the array being used. Typically n/t is a small constant smaller than 1.

79
In the following, we assume that U (the universe the keys are taken from) is large – specifically, N = |U| 
m2 , where m is the size of the table. Consider a hash function h : U → {0, . . . , m − 1}. If hash N items to the m
slots, then by the pigeon hole principle, there is some i ∈ {0, . . . , m − 1} such that N/m ≥ m elements of U get
hashed to i. In particular, this implies that there is set S ⊆ U, where |S| = m such that all of S hashes to same
slot. Oops.
Namely, for every hash function there is a bad set with many collisions.

Observation 11.1.1. Let H be the set of all functions from U = {1, . . . , U} to {1, . . . , m}. The number of
functions in H is mU . As such, specifying a function in H would require log2 |H | = O(U log m).

As such, picking a truely random hash function requires many random bits, and furthermore, it is not even
clear how to evaluate it efficiently (which is the whole point of hashing).

Picking a hash function. Picking a good hash function in practice is a dark art involving many non-trivial
considerations and ideas. For parameters N = |U|, m = |T |, and n = |S|, we require the following:
(A) H is a family of hash functions: each function h ∈ H should be efficient to evaluate (that is, to compute
h(x)).
(B) h is chosen randomly from H (typically uniformly at random). Implicitly assumes that H allows an
efficient sampling.
(C) Require that for any fixed set S ⊆ U, of size m, the expected number of collisions for a function chosen
from H should be “small”. Here the expectation is over the randomness in choice of h.

11.2. Universal Hashing


We would like the hash function to have the following property – For any element x ∈ U, and a random h ∈ H ,
then h(x) should have a uniform distribution. That is Pr[h(x) = i] = 1/m, for every 0 ≤ i < m. A somewhat
stronger property is that for any two distinct elements x, y ∈ U, for a random h ∈ H , the probability of a
collision between x and y should be at most 1/m. P[h(x) = h(y)] = 1/m.

Definition 11.2.1. A family H of hash functions is 2-universal if for all distinct x, y ∈ U, we have P[h(x) = h(y)] ≤
1/m.

Applying a 2-universal family hash function to a set of distinct numbers, results in a 2-wise independent
sequence of numbers.

Lemma 11.2.2. Let S be a set of n elements stored using open hashing in a hash table of size m, using open
hashing, where the hash function is picked from a 2-universal family. Then, the expected lookup time, for any
element x ∈ U is O(n/m).

Proof: The number of elements colliding with x is `(x) = Dy , where Dy = 1 ⇐⇒ x and y collide under
Í
y ∈S
the hash function h. As such, we have
Õ   Õ Õ 1
E[`(x)] = E Dy = P[h(x) = h(y)] = = |S|/m = n/m.
y ∈S y ∈S y ∈S
m

Remark 11.2.3. The above analysis holds even if we perform a sequence of O(n) insertions/deletions operations.
Indeed, just repeat the analysis with the set of elements being all elements encountered during these operations.
The worst-case bound is of course much worse – it is not hard to show that in the worst case, the load of a
single hash table entry might be Ω(log n/log log n) (as we seen in the occupancy problem).

80
Rehashing, amortization, etc. The above assumed that the set S is fixed. If items are inserted and deleted,
then the hash table might become much worse. In particular, |S| grows to more than cm, for some constant c,
then hash table performance start degrading. Furthermore, if many insertions and deletions happen then the
initial random hash function is no longer random enough, and the above analysis no longer holds.
A standard solution is to rebuild the hash table periodically. We choose a new table size based on current
number of elements in table, and a new random hash function, and rehash the elements. And then discard the
old table and hash function. In particular, if |S| grows to more than twice current table size, then rebuild new
hash table (choose a new random hash function) with double the current number of elements. One can do a
similar shrinking operation if the set size falls below quarter the current hash table size.
If the working |S| stays roughly the same but more than c|S| operations on table for some chosen constant
c (say 10), rebuild.
The amortize cost of rebuilding to previously performed operations. Rebuilding ensures O(1) expected
analysis holds even when S changes. Hence O(1) expected look up/insert/delete time dynamic data dictionary
data structure!

11.2.1. How to build a 2-universal family


11.2.1.1. On working modulo prime
Definition 11.2.4. For a number p, let ZZn = 0, . . . , n − 1 .


For two integer numbers x and y, the quotient of x/y is x div y = bx/yc. The remainder of x/y is
x mod y = x − y bx/yc. If the x mod y = 0, than y divides x, denoted by y | x. We use α ≡ β (mod p) or α ≡ p β
to denote that α and β are congruent modulo p; that is α mod p = β mod p – equivalently, p | (α − β).

Lemma 11.2.5. Let p be a prime number.


(A) For any α, β ∈ {1, . . . , p − 1}, we have that αβ . 0 (mod p).
(B) For any α, β, i ∈ {1, . . . , p − 1}, such that α , β, we have that αi . βi (mod p).
(C) For any x ∈ {1, . . . , p − 1} there exists a unique y such that x y ≡ 1 (mod p). The number y is the inverse
of x, and is denoted by x −1 or 1/x.

Proof: (A) If αβ ≡ 0 (mod p), then p must divide αβ, as it divides 0. But α, β are smaller than p, and p is
prime. This implies that either p | α or p | β, which is impossible.
(B) Assume that α > β. Furthermore, for the sake of contradiction, assume that αi ≡ βi (mod p). But then,
(α − β)i ≡ 0 (mod p), which is impossible, by (A).
(C) For any α ∈ {1, . . . , p − 1}, consider the set Lα = {α ∗ 1 mod p, α ∗ 2 mod p, . . . , α ∗ (p − 1) mod p}. By
(A), zero is not in Lα , and by (B), Lα must contain p − 1 distinct values. It follows that Lα = {1, 2, . . . , p − 1}.
As such, there exists exactly one number y ∈ {1, . . . , p − 1}, such that αy ≡ 1 (mod p).

Lemma 11.2.6. Consider a prime p, and any numbers x, y ∈ ZZ p . If x , y then, for any a, b ∈ Z p , such that
a , 0, we have ax + b . ay + b (mod p).

Proof: Assume y > x (the other case is handled similarly). If ax + b ≡ ay + b (mod p) then a(x − y) (mod p) = 0
and a , 0 and (x − y) , 0. However, a and x − y cannot divide p since p is prime and a < p and 0 < x − y < p.

Lemma 11.2.7. Consider a prime p, and any numbers x, y ∈ ZZ p . If x , y then, for each pair of numbers
r, s ∈ ZZ p = {0, 1, . . . , p − 1}, such that r , s, there is exactly one unique choice of numbers a, b ∈ ZZ p such that
ax + b (mod p) = r and ay + b (mod p) = s.

Proof: Solve the system of equations

ax + b ≡ r (mod p) and ay + b ≡ s (mod p).

We get a = r−s
x−y (mod p) and b = r − ax (mod p).

81
11.2.1.2. Constructing a family of 2-universal hash functions
For parameters N = |U|, m = |T |, n = |S|. Choose a prime number p ≥ N. Let

H = ha,b a, b ∈ ZZ p and a , 0 ,


where ha,b (x) = ((ax + b) (mod p)) (mod m). Note that |H | = p(p − 1).

11.2.1.3. Analysis
Once we fix a and b, and we are given a value x, we compute the hash value of x in two stages:
(A) Compute: r ← (ax + b) (mod p).
(B) Fold: r 0 ← r (mod m)

Lemma 11.2.8. Assume that p is a prime, and 1 < m < p. The number of pairs (r, s) ∈ ZZ p × ZZ p , such that
r , s, that are folded to the same number is ≤ p(p − 1)/m. Formally, the set of bad pairs

B = (r, s) ∈ ZZ p × ZZ p r ≡m s


is of size at most p(p − 1)/m.

Proof: Consider a pair (x, y) ∈ {0, 1, . . . , p − 1}2 , such that x , y. For a fixed x, there are at most dp/me values
of y that fold into x. Indeed, x ≡m y if and only if

y ∈ L(x) = {x + im | i is an integer} ∩ ZZ p .

The size of L(x) is maximized when x = 0, The number of such elements is at most dp/me (note, that since p is
a prime, p/m is fractional). One of the numbers in O(x) is x itself. As such, we have that

|B| ≤ p |L(x)| − 1 ≤ p dp/me − 1 ≤ p p − 1 /m,


  

since dp/me − 1 ≤ (p − 1)/m ⇐⇒ m dp/me − m ≤ p − 1 ⇐⇒ m bp/mc ≤ p − 1 ⇐⇒ m bp/mc < p, which is true


since p is a prime, and 1 < m < p.

Claim 11.2.9. For two distinct numbers x, y ∈ U, a pair a, b is bad if ha,b (x) = ha,b (y). The number of bad
pairs is ≤ p(p − 1)/m.

Proof: Let a, b ∈ Z p such that a , 0 and ha,b (x) = ha,b (y). Let

r = (ax + b) mod p and s = (ay + b) mod p.

By Lemma 11.2.6, we have that r , s. As such, a collision happens if r ≡ s (mod m). By Lemma 11.2.8, the
number of such pairs (r, s) is at most p(p − 1)/m. By Lemma 11.2.7, for each such pair (r, s), there is a unique
choice of a, b that maps x and y to r and s, respectively. As such, there are at most p(p − 1)/m bad pairs.

Theorem 11.2.10. The hash family H is a 2-universal hash family.

Proof: Fix two distinct numbers x, y ∈ U. We are interested in the probability they collide if h is picked
randomly from H . By Claim 11.2.9 there are M ≤ p(p − 1)/m bad pairs that causes such a collision, and since
H contains N = p(p − 1) functions, it follows the probability for collision is M/N ≤ 1/m, which implies that H
is 2-universal.

82
(A) (B) (C)

Figure 11.2: Explanation of the hashing scheme via figures.

11.2.1.4. Explanation via pictures


Consider a pair (x, y) ∈ ZZ2p , such that x , y. This pair (x, y) corresponds to a cell in the natural “grid” ZZ2p that
is off the main diagonal. See Figure 11.2
The mapping fa,b (x) = (ax + b) mod p, takes the pair (x, y), and maps it randomly and uniformly, to some
other pair x 0 = fa,b (x) and y 0 = fa,b (y) (where x 0, y 0 are again off the main diagonal).
Now consider the smaller grid ZZm × ZZm . The main diagonal of this subgrid is bad – it corresponds to a
collision. One can think about the last step, of computing ha,b (x) = fa,b (x) mod m, as tiling the larger grid, by
the smaller grid. in the natural way. Any diagonal that is in distance mi from the main diagonal get marked as
bad. At most 1/m fraction of the off diagonal cells get marked as bad. See Figure 11.2.
As such, the random mapping of (x, y) to (x 0, y 0) causes a collision only if we map the pair to a badly marked
pair, and the probability for that ≤ 1/m.

11.3. Perfect hashing


An interesting special case of hashing is the static case – given a set S of elements, we want to hash S so that
we can answer membership queries efficiently (i.e., dictionary data-structures with no insertions). it is easy to
come up with a hashing scheme that is optimal as far as space.

11.3.1. Some easy calculations


The first observation is that if the hash table is quadraticly large, then there is a good (constant) probability
to have no collisions (this is also the threshold for the birthday paradox).

Lemma 11.3.1. Let S ⊆ U be a set of n elements, and let H be a 2-universal family of hash functions, into a
table of size m ≥ n2 . Then with probability ≤ 1/2, there is a pair of elements of S that collide under a random
hash function h ∈ H .

Proof: For a pair x, y ∈ S, the probability they


 collide is at most ≤ 1/m, by definition. As such, by the union
bound, the probability of any collusion is n2 /m = n(n − 1)/2m ≤ 1/2.

We now need a second moment bound on the sizes of the buckets.

Lemma 11.3.2. Let S ⊆ U be a set of n elements, and let H be a 2-universal family of hash functions,
into a table of size m ≥ cn, where c is an arbitrary constant. Let h ∈ H be a random hash function, and

83
leth Xi be the number of elements of S mapped to the ith bucket by h, for i = 0, . . . , m − 1. Then, we have
Ím−1 2 i
E j=0 X j ≤ (1 + 2/c)n.

Let s1, . . . , sn be the n items in S, and let Zi, j = 1 if h(si ) = h(s j ), for i < j. Observe that E Zi, j =
 
Proof:
P h(si ) = h(s j ) ≤ 1/m (this is the only place we use the property that H is 2-universal). In particular, let Z(α)


be all the variables Zi, j , for i < j, such that Zi, j = 1 and h(si ) = h(s j ) = α.
If for some α we have that Xα = k, then there are k indices `1 < `2 < . . . < `k , such that h(s`1 ) = · · · =
h(s`k ) = i. As such, z(α) = |Z(α)| = k2 . In particular, we have
k
 
Xα = k = 2
2 2
+ k = 2z(α) + Xα
2
This implies that
m−1
Õ m−1
Õ m−1
Õ m−1
Õ m−1
Õ m−1
Õ n−1 Õ
Õ n
Xα2 2z(α) + Xα = 2 z(α) + Xα = 2 z(α) + Xα = n + 2 Zi j

=
α=0 α=0 α=0 α=0 iα=0 α=0 i=1 j=i+1

Now, by linearity of expectations, we have


h m−1
Õ i h n−1 Õ
Õ n i n−1 Õ
Õ n n−1 Õ
Õ n
1  n

2

Xα2 =E n+2 Zi j = n + 2 E Zi j ≤ n+2 ≤ n 1+2 ≤ n 1+
 
E
α=0 i=1 j=i+1 i=1 j=i+1 i=1 j=i+1
m m c
since m ≥ cn.

11.3.2. Construction of perfect hashing


Given a set S of n elements, we build a open hash table T of size, say, 2n. We use a random hash function
h that is 2-universal for this hash table, see Theorem 11.2.10. Next, we map the elements of S into the hash
table. Let S j be the list of all the elements of S mapped to the jth bucket, and let X j = L j , for j = 0, . . . , n − 1.
We compute Y = i=1 X j2 . If Y > 6n, then we reject h, and resample a hash function h. We repeat this
Í
process till success.
In the second stage, we build secondary hash tables for each bucket. Specifically, for j = 0, . . . , 2n − 1, if
the jth bucket contains X j > 0 elements, then we construct a secondary hash table H j to store the elements of
S j , and this secondary hash table has size X j2 , and again we use a random 2-universal hash function h j for the
hashing of S j into H j . If any pair of elements of S j collide under h j , then we resample the hash function h j , and
try again till success.

11.3.2.1. Analysis
Theorem 11.3.3. Given a (static) set S ⊆ U of n elements, the above scheme, constructs, in expected linear
time, a two level hash-table that can perform search queries in O(1) time. The resulting data-structure uses O(n)
space.
Proof: Given an element x ∈ U, we first compute j = h(x), and then k = h j (x), and we can check whether the
element stored in the secondary hash table H j at the entry k is indeed x. As such, the search time is O(1).
The more interesting issue is the construction time. Let X j be the number of elements mapped to the jth
bucket, and let Y = i=1 Xi2 . Observe, that E[Y ] = (1 + 2/2)n = 2n, by Lemma 11.3.2 (here, m = 2n and c = 2).
Ín
As such, by Markov’s inequality, P[X > 6n] ≤ 1/2. In particular, picking a good top level hash function requires
in expectation 1/(1/2) = 2 iterations. Thus the first stage takes O(n) time, in expectation.
For the jth bucket, with X j entries, by Lemma 11.3.1, the construction succeeds with probability ≥ 1/2. As
before, the expected number of iterations till success is at most 2. As such, the expected construction time of
the secondary hash table for the jth bucket is O(X j2 ).
We conclude that the overall expected construction time is O(n + j X j2 ) = O(n).
Í

As for the space used, observe that it is O(n + j X j2 ) = O(n).


Í

84
11.4. Bloom filters
Consider an application where we have a set S ⊆ U of n elements, and we want to be able to decide for a query
x ∈ U, whether or not x ∈ S. Naturally, we can use hashing. However, here we are interested in more efficient
data-structure as far as space. We allow the data-structure to make a mistake (i.e., say that an element is in,
when it is not in).

First try. So, let start silly. Let B[0 . . . , m] be an array of bits, and pick a random hash function h : U → ZZm .
Initialize B to 0. Next, for every element s ∈ S, set B[h(s)] to 1. Now, given a query, return B[h(x)] as an answer
whether or not x ∈ S. Note, that B is an array of bits, and as such it can be bit-packed and stored efficiently.
For the sake of simplicity of exposition, assume that the hash functions picked is truly random. As such,
we have that the probability for a false positive (i.e., a mistake) for a fixed x ∈ U is n/m. Since we want the
size of the table m to be close to n, this is not satisfying.

Using k hash functions. Instead of using a single hash function, let us use k independent hash functions
h1, . . . hk . For an element s ∈ S, we set B[hi (s)] to 1, for i = 1, . . . , k. Given an query x ∈ U, if B[hi (x)] is zero,
for any i = 1, . . . , k, then x < S. Otherwise, if all these k bits are on, the data-structure returns that x is in S.
Clearly, if the data-structure returns that x is not in S, then it is correct. The data-structure might make
a mistake (i.e., a false positive), if it returns that x is in S (when is not in S).
We interpret the storing of the elements of S in B, as an experiment of throwing kn balls into m bins. The
probability of a bin to be empty is

p = p(m, n) = (1 − 1/m)kn ≈ exp(−k(n/m)).

Since the number of empty bins is a martingale, we know the number of empty bins is strongly concentrated
around the expectation pm, and we can treat p as the true probability of a bin to be empty.
The probability of a mistake is
f (k, m, n) = (1 − p)k .
In particular, for k = (m/n) ln n, we have that p = p(m, n) ≈ 1/2, and f (k, m, n) ≈ 1/2(m/n) ln 2 ≈ 0.618m/n .

Example 11.4.1. Of course, the above is fictional, as k has to be an integer. But motivated by these calculations,
let m = 3n, and k = 4. We get that p(m, n) = exp(−4/3) ≈ 0.26359, and f (4, 3n, n) ≈ (1 − 0.265)4 ≈ 0.294078. This
is better than the naive k = 1 scheme, where the probability of false positive is 1/3.

Note, that this scheme gets exponentially better over the naive scheme as m/n grows.

Example 11.4.2. Consider the setting m = 8n – this is when we allocate a byte for each element stored (the
element of course might be significantly bigger). The above implies we should take k = d(m/n) ln 2e = 6. We
then get p(8n, n) = exp(−6/8) ≈ 0.5352, and f (6, 8n, n) ≈ 0.0215. Here, the naive scheme with k = 1, would give
probability of false positive of 1/8 = 0.125. So this is a significant improvement.

Remark 11.4.3. It is important to remember that Bloom filters are competing with direct hashing of the whole
elements. Even if one allocates 8 bits per item, as in the example above, the space it uses is significantly smaller
than regular hashing. A situation when such a Bloom filter makes sense is for a cache – we might want to
decide if an element is in a slow external cache (say SSD drive). Retrieving item from the cache is slow, but
not so slow we are not willing to have a small overhead because of false positives.

11.5. Bibliographical notes


Practical Issues Hashing used typically for integers, vectors, strings etc.

85
• Universal hashing is defined for integers. To implement it for other objects, one needs to map objects in
some fashion to integers.

• Practical methods for various important cases such as vectors, strings are studied extensively. See http:
//en.wikipedia.org/wiki/Universal_hashing for some pointers.

• Recent important paper bridging theory and practice of hashing. “The power of simple tabulation hash-
ing” by Mikkel Thorup and Mihai Patrascu, 2011. See http://en.wikipedia.org/wiki/Tabulation_
hashing

86
Chapter 12

Min Cut

To acknowledge the corn - This purely American expression means to admit the losing of an argument, especially in
regard to a detail; to retract; to admit defeat. It is over a hundred years old. Andrew Stewart, a member of Congress, is said
to have mentioned it in a speech in 1828. He said that haystacks and cornfields were sent by Indiana, Ohio and Kentucky to
Philadelphia and New York. Charles A. Wickliffe, a member from Kentucky questioned the statement by commenting that
haystacks and cornfields could not walk. Stewart then pointed out that he did not mean literal haystacks and cornfields, but
the horses, mules, and hogs for which the hay and corn were raised. Wickliffe then rose to his feet, and said, “Mr. Speaker,
I acknowledge the corn”.

Funk, Earle, A Hog on Ice and Other Curious Expressions

12.1. Branching processes – Galton-Watson Process


12.1.1. The problem
In the 19th century, Victorians were worried that aristocratic surnames were disappearing, as family names
passed on only through the male children. As such, a family with no male children had its family name disappear.
So, imagine the number of male children of a person is an independent random variable X ∈ {0, 1, 2, . . .}. Starting
with a single person, its family (as far as male children are concerned) is a random tree with the degree of a
node being distributed according to X. We continue recursively in constructing this tree, again, sampling the
number of children for each current leaf according to the distribution of X. It is not hard to see that a family
disappears if E[X] ≤ 1, and it has a constant probability of surviving if E[X] > 1.
Francis Galton asked the question of what is the probability of such a blue-blood family name to survive,
and this question was answered by Henry William Watson [WG75]. The Victorians were worried about strange
things, see [Gre69] for a provocatively titled article from the period, and [Ste12] for a more recent take on this
issue.
Of course, since infant mortality is dramatically down (as is the number of aristocrat males dying to maintain
the British empire), the probability of family names to disappear is now much lower than it was in the 19th
century (not to mention that many women keep their original family name). Interestingly, countries with family
names that were introduced long time ago have very few surnames (i.e., Korean have 250 surnames, and three
surnames form 45% of the population). On the other hand, countries that introduced surnames more recently
have dramatically more surnames (for example, the Dutch have surnames only for the last 200 years, and there
are 68, 000 different family names).
Here we are going to look on a very specific variant of this problem. Imagine that starting with a single male.
A male has exactly two children, and each one of them is a male with probability half. As such, the natural
question is what is the probability that h generations down, there is a male decedent that all his ancestors are
male.

12.1.2. On coloring trees


Let Th be a complete binary tree of height h. We randomly color its edges by black and white. Namely, for each
edge we independently choose its color to be either black or white, with equal probability (say, black indicates

87
the child is male). We are interested in the event that there exists a path from the root of Th to one of its leafs,
that is all black. Let Eh denote this event, and let ρh = P[Eh ]. Observe that ρ0 = 1 and ρ1 = 3/4 (see below).
To bound this probability, consider the root u of Th and its two children ul and ur . The probability that
there is a black path from ul to one of its children is ρh−1 , and as such, the probability that there is a black
path from u through ul to a leaf of the subtree of ul is P the edge uul is colored black · ρh−1 = ρh−1 /2. As such,


the probability that there is no black path through ul is 1 − ρh−1 /2. As such, the probability of not having a
black path from u to a leaf (through either children) is (1 − ρh−1 /2)2 . In particular, there desired probability, is
the complement; that is
 ρh−1  2 ρh−1  ρh−1  ρ2h−1
= f ρh−1 f (x) = x − x 2 /4.

ρh = 1 − 1 − = 2− = ρh−1 − for
2 2 2 4
The starting values are ρ0 = 1, and ρ1 = 3/4. Formally, we have the sequence:

ρ2h−1
ρ0 = 1, ρ1 = 3/4, ρh = ρh−1 − .
4

f(x)=x - x2/4
1

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 12.1: A graph of the function f (x) = x − x 2 /4.

Lemma 12.1.1. We have that ρh ≥ 1/(h + 1).

Proof: The proof is by induction. For h = 1, we have ρ1 = 3/4 ≥ 1/(1 + 1).


Observe that ρh = f (ρh−1 ) for f (x) = x − x 2 /4, and f 0(x) = 1 − x/2. As such, f 0(x) > 0 for x ∈ [0,
 1] and f (x)
1 1 1
is increasing in the range [0, 1]. As such, by induction, we have that ρh = f (ρh−1 ) ≥ f = − 2.
(h − 1) + 1 h 4h
We need to prove that ρh ≥ 1/(h + 1), which is implied by the above if
1 1 1
− ≥ ⇔ 4h(h + 1) − (h + 1) ≥ 4h2 ⇔ 4h2 + 4h − h − 1 ≥ 4h2 ⇔ 3h ≥ 1,
h 4h2 h+1
which trivially holds.

One can also prove an upper bound on this probability, showing that ρh = Θ(1/h). We provide the proof
here for the sake of completeness, but the reader is encouraged to skip reading its proof, as we do not need this
result.

Lemma 12.1.2. We have that ρh = O(1/h).

Proof: The claim trivially holds for small values of h. For any j > 0, let h j be the minimal index such that
ρh j − ρh j+1
ρh j ≤ 1/2 j . It is easy to verify that ρh j ≥ 1/2 j+1 . We claim (mysteriously) that h j+1 − h j ≤ Indeed,
(ρh j+1 )2 /4
ρk+1 is the number resulting from removing ρ2k /4 from ρk . Namely, the sequence ρ1, ρ2, . . . is a monotonically
decreasing sequence of numbers in the interval [0, 1], where the gaps between consecutive numbers decreases.

88
2
In particular, to get from ρh j to ρh j+1 , the gaps used were of size at least ∆ = ρh j+1 , which means that there
are at least (ρh j − ρh j+1 )/∆ − 1 numbers in the series between these two elements. As such, we have
ρh j − ρh j+1 1/2 j − 1/2 j+2  
h j+1 − h j ≤ ≤ = 2 j+6
+ 2 j+4
= O 2j .
(ρh j+1 )2 /4 1/2 2(j+2)+2

Arguing similarly, we have


ρh j − ρh j+2 1/2 j+1 − 1/2 j+2  
h j+2 − h j ≥ ≥ = 2 j+1
+ 2 j
= Ω 2j .
(ρh j )2 /4 1/22j+2

We conclude that h j = (h j − h j−2 ) + (h j−2 − h j−4 ) + · · · = 2 j−1 − O(1), implying the claim.

12.2. Min Cut


12.2.1. Problem Definition
Let G = (V, E) be an undirected graph with n vertices and m edges. We are interested in cuts in G.
Definition 12.2.1. A cut in G is a partition of the vertices of V into two sets S and V \ S,
where the edges of the cut are

(S, V \ S) = uv ∈ E u ∈ S, v ∈ V \ S ,


where S , ∅ and V \ S , ∅. The number of edges in the cut (S, V \ S) is the size of the S V \S
cut. For an example of a cut, see figure on the right.
We are interested in the problem of computing the minimum cut (i.e., mincut), that is, the cut in the
graph with minimum cardinality. Specifically, we would like to find the set S ⊆ V such that (S, V \ S) is as small
as possible, and S is neither empty nor V \ S is empty.

12.2.2. Some Definitions


We remind the reader of the following concepts. The conditional probability of X given Y is P X = x Y = y =
 

P[(X = x) ∩ (Y = y)]/P[Y = y]. An equivalent, useful restatement of this is that


P (X = x) ∩ (Y = y) = P X = x Y = y · P[Y = y].
   
(12.1)
The following is easy to prove by induction using Eq. (12.1).
Lemma 12.2.2. Let E1, . . . , En be n events which are not necessarily independent. Then,
 n         
P ∩i=1 Ei = P E1 ∗ P E2 E1 ∗ P E3 E1 ∩ E2 ∗ . . . ∗ P En E1 ∩ . . . ∩ En−1 .

12.3. The Algorithm


The basic operation used by the algorithm is edge con-
traction, depicted in Figure 12.2. We take an edge e = x y
in G and merge the two vertices into a single vertex. The
new resulting graph is denoted by G/x y. Note, that we re-
y {x, y}
move self loops created by the contraction. However, since x
the resulting graph is no longer a regular graph, it has par- (a) (b)
allel edges – namely, it is a multi-graph. We represent a
multi-graph, as a regular graph with multiplicities on the Figure 12.2: (a) A contraction of the edge xy.
edges. See Figure 12.3. (b) The resulting graph.

89
2
2 2
2 2
y
x
(a) (b) (c) (d)

2 2
2 2 4 4
2 2 2 3 3
2 2 52 5

(e) (f) (g) (h)

(i) (j)

Figure 12.4: (a) Original graph. (b)–(j) a sequence of contractions in the graph, and (h) the cut in the original
graph, corresponding to the single edge in (h). Note that the cut of (h) is not a mincut in the original graph.

The edge contraction operation can be implemented in


O(n) time for a graph with n vertices. This is done by merg-
ing the adjacency lists of the two vertices being contracted, 2 2
and then using hashing to do the fix-ups (i.e., we need to 2 2
fix the adjacency list of the vertices that are connected to
the two vertices).
Note, that the cut is now computed counting multiplic- Figure 12.3: On the left a multi-graph, and on
ities (i.e., if e is in the cut and it has weight w, then the the right a minimum cut in the resulting multi-
contribution of e to the cut weight is w). graph.
Observation 12.3.1. A set of vertices in G/xy corresponds to a set of vertices in the graph G. Thus a cut
in G/x y always corresponds to a valid cut in G. However, there are cuts in G that do not exist in G/xy. For
example, the cut S = {x}, does not exist in G/x y. As such, the size of the minimum cut in G/xy is at least
as large as the minimum cut in G (as long as G/x y has at least one edge). Since any cut in G/xy has a
corresponding cut of the same cardinality in G.

Our algorithm works by repeatedly performing edge contractions. This is beneficial as this shrinks the
underlying graph, and we would compute the cut in the resulting (smaller) graph. An “extreme” example of
this, is shown in Figure 12.4, where we contract the graph into a single edge, which (in turn) corresponds to
a cut in the original graph. (It might help the reader to think about each vertex in the contracted graph, as
corresponding to a connected component in the original graph.)
Figure 12.4 also demonstrates the problem with taking this approach. Indeed, the resulting cut is not the
minimum cut in the graph.
So, why did the algorithm fail to find the minimum cut in this case?¬ The failure occurs because of the
contraction at Figure 12.4 (e), as we had contracted an edge in the minimum cut. In the new graph, depicted
¬ Naturally, if the algorithm had succeeded in finding the minimum cut, this would have been our success.

90
Algorithm MinCut(G)
G0 ← G
i=0
while Gi has more than two vertices do
Pick randomly an edge ei from the edges of Gi
Gi+1 ← Gi /ei
i ←i+1
Let (S, V \ S) be the cut in the original graph
corresponding to the single edge in Gi
return (S, V \ S).

Figure 12.5: The minimum cut algorithm.

in Figure 12.4 (f), there is no longer a cut of size 3, and all cuts are of size 4 or more. Specifically, the algorithm
succeeds only if it does not contract an edge in the minimum cut.

Observation 12.3.2. Let e1, . . . , en−2 be a sequence of edges in G, such that none of them is in the minimum
cut, and such that G0 = G/{e1, . . . , en−2 } is a single multi-edge. Then, this multi-edge corresponds to a minimum
cut in G.

Note, that the claim in the above observation is only in one direction. We might be able to still compute
a minimum cut, even if we contract an edge in a minimum cut, the reason being that a minimum cut is not
unique. In particular, another minimum cut might survived the sequence of contractions that destroyed other
minimum cuts.
Using Observation 12.3.2 in an algorithm is problematic, since the argumentation is circular, how can we
find a sequence of edges that are not in the cut without knowing what the cut is? The way to slice the Gordian
knot here, is to randomly select an edge at each stage, and contract this random edge.
See Figure 12.5 for the resulting algorithm MinCut.

12.3.1. Analysis
12.3.1.1. The probability of success
Naturally, if we are extremely lucky, the algorithm would never pick an edge in the mincut, and the algorithm
would succeed. The ultimate question here is what is the probability of success. If it is relatively “large” then
this algorithm is useful since we can run it several times, and return the best result computed. If on the other
hand, this probability is tiny, then we are working in vain since this approach would not work.

Lemma 12.3.3. If a graph G has a minimum cut of size k and G has n vertices, then |E(G)| ≥ kn/2.

Proof: Each vertex degree is at least k, otherwise the vertex itself would form a minimum cut of size smaller
than k. As such, there are at least v ∈V degree(v)/2 ≥ nk/2 edges in the graph.
Í

Lemma 12.3.4. If we pick in random an edge e from a graph G, then with probability at most 2/n it belong to
the minimum cut.

Proof: There are at least nk/2 edges in the graph and exactly k edges in the minimum cut. Thus, the probability
of picking an edge from the minimum cut is smaller then k/(nk/2) = 2/n.

The following lemma shows (surprisingly) that MinCut succeeds with reasonable probability.
2
Lemma 12.3.5. MinCut outputs the mincut with probability ≥ .
n(n − 1)

91
Proof: Let Ei be the event that ei is not in the minimum cut of Gi . By Observation 12.3.2, MinCut outputs
the minimum cut if the events E0, . . . , En−3 all happen (namely, all edges picked are outside the minimum cut).
h i 2 2
By Lemma 12.3.4, it holds P Ei E0 ∩ E1 ∩ . . . ∩ Ei−1 ≥ 1 − =1− . Implying that
|V(Gi )| n−i
         
∆ = P E0 ∩ . . . ∩ En−3 = P E0 · P E1 E0 · P E2 E0 ∩ E1 · . . . · P En−3 E0 ∩ . . . ∩ En−4 .
As such, we have
n−3  n−3
n−i−2 n−2 n−3 n−4
 Ö
Ö 2 2 1 2
∆≥ 1− = = · · ·...· · = .
i=0
n−i i=0
n−i n n−1 n−2 4 3 n(n − 1)

12.3.1.2. Running time analysis.


Observation 12.3.6. MinCut runs in O(n2 ) time.
Observation 12.3.7. The algorithm always outputs a cut, and the cut is not smaller than the minimum cut.
Informally, amplification is the process of running an experiment again and again till the things we want
to happen, with good probability, do happen.
Let MinCutRep be the algorithm that runs MinCut n(n − 1) times and return the minimum cut computed
in all those independent executions of MinCut.
Lemma 12.3.8. The probability that MinCutRep fails to return the minimum cut is < 0.14.
2
Proof: The probability of failure of MinCut to output the mincut in each execution is at most 1 − n(n−1) , by
Lemma 12.3.5. Now, MinCutRep fails, only if all the n(n − 1) executions of MinCut fail. But these executions
are independent, as such, the probability to this happen is at most
  n(n−1)  
2 2
1− ≤ exp − · n(n − 1) = exp(−2) < 0.14,
n(n − 1) n(n − 1)
since 1 − x ≤ e−x for 0 ≤ x ≤ 1.
Theorem 12.3.9. One can compute the minimum cut in O(n4 ) time with constant probability to get a correct
result. In O n log n time the minimum cut is returned with high probability.
4


12.4. A faster algorithm


The algorithm presented in the previous section is extremely simple. Which raises the question of whether we
can get a faster algorithm­ ?
So, why MinCutRep needs so many executions? Well, the probability of success in the first ν iterations is
ν−1  ν−1
n−i−2 n−2 n−3 n−4
 Ö
  Ö 2 (n − ν)(n − ν − 1)
P E0 ∩ . . . ∩ Eν−1 ≥ 1− = = · · ... = . (12.2)
i=0
n−i i=0
n−i n n−1 n−2 n · (n − 1)
Namely, this probability deteriorates very quickly toward the end of the execution, when the graph becomes

small enough. (To see this, observe that for ν = n/2, the probability of success is roughly 1/4, but for ν = n − n
the probability of success is roughly 1/n.)
So, the key observation is that as the graph get smaller the probability to make a bad choice increases. So,
instead of doing the amplification from the outside of the algorithm, we will run the new algorithm more times
when the graph is smaller. Namely, we put the amplification directly into the algorithm.
The basic new operation we use is Contract, depicted in Figure 12.6, which also depict the new algorithm
FastCut.
­ This would require a more involved algorithm, thats life.

92
FastCut(G = (V, E))
G – multi-graph
begin
n ← |V(G)|
Contract ( G, t ) if n ≤ 6 then
begin Compute (via brute force) minimum cut
while |V(G)| > t do lof G and
√ m
return cut.
Pick a random edge e in G. t ← 1 + n/ 2
G ← G/e H1 ← Contract(G, t)
return G H2 ← Contract(G, t)
end /* Contract is randomized!!! */
X1 ← FastCut(H1 ),
X2 ← FastCut(H2 )
return minimum cut out of X1 and X2 .
end

Figure 12.6: Contract(G, t) shrinks G till it has only t vertices. FastCut computes the minimum cut using
Contract.

Lemma 12.4.1. The running time of FastCut(G) is O n2 log n , where n = |V(G)|.




Proof: Well, we perform two calls to Contract(G, t) which takes O(n2 ) time. And then we perform two recursive
calls on the resulting graphs. We have
√ 
T(n) = O(n2 ) + 2T n/ 2 .

The solution to this recurrence is O n2 log n as one can easily (and should) verify.


Exercise 12.4.2. Show that one can modify FastCut so that it uses only O(n2 ) space.
√ 
Lemma 12.4.3. The probability that Contract G, n/ 2 had not contracted the minimum cut is at least 1/2.
Namely, the probability that the minimum cut in the contracted graph is still a minimum cut in the original
graph is at least 1/2.
√ 
Proof: Just plug in ν = n − t = n − 1 + n/ 2 into Eq. (12.2). We have


√   √  
n/ n/

  t(t − 1) 1 + 2 1 + 2 − 1 1
P E0 ∩ . . . ∩ En−t ≥ = ≥ .
n · (n − 1) n(n − 1) 2

The following lemma bounds the probability of success.

Lemma 12.4.4. FastCut finds the minimum cut with probability larger than Ω(1/log n).

Proof: Let Th be the recursion tree of the algorithm of depth h = Θ(log n). Color an edge of recursion tree by
black if the contraction succeeded. Clearly, the algorithm succeeds if there is a path from the root to a leaf that
is all black. This is exactly the settings of Lemma 12.1.1, and we conclude that the probability of success is at
least 1/(h + 1) = Θ(1/log n), as desired.

Exercise 12.4.5. Prove, that running FastCut repeatedly c · log2 n times, guarantee that the algorithm outputs
the minimum cut with probability ≥ 1 − 1/n2 , say, for c a constant large enough.

93
Theorem 12.4.6. One can compute the minimum cut in a graph G with n vertices in O(n2 log3 n) time. The
algorithm succeeds with probability ≥ 1 − 1/n2 .

Proof: We do amplification on FastCut by running it O(log2 n) times. The running time bound follows from
Lemma 12.4.1. The bound on the probability follows from Lemma 12.4.4, and using the amplification analysis
as done in Lemma 12.3.8 for MinCutRep.

12.5. Bibliographical Notes


The MinCut algorithm was developed by David Karger during his PhD thesis in Stanford. The fast algorithm
is a joint work with Clifford Stein. The basic algorithm of the mincut is described in [MR95, pages 7–9], the
faster algorithm is described in [MR95, pages 289–295].

Galton-Watson process. The idea of using coloring of the edges of a tree to analyze FastCut might be new
(i.e., Section 12.1.2).

94
Part V
Network flow

Chapter 13

Network Flow

13.1. Network Flow


We would like to transfer as much “merchandise” as possible from one point to another. For example, we have
a wireless network, and one would like to transfer a large file from s to t. The network have limited capacity,
and one would like to compute the maximum amount of information one can transfer.
Specifically, there is a network and capacities associated with each u 12 v
connection in the network. The question is how much “flow” can you 16 20
transfer from a source s into a sink t. Note, that here we think about s 10
the flow as being splitable, so that it can travel from the source to the 4 7
9
sink along several parallel paths simultaneously. So, think about our t
13 4
network as being a network of pipe moving water from the source the w 14 x
sink (the capacities are how much water can a pipe transfer in a given
unit of time). On the other hand, in the internet traffic is packet based Figure 13.1: A network flow.
and splitting is less easy to do.

Definition 13.1.1. Let G = (V, E) be a directed graph. For every edge (u, v) ∈ E(G) we have an associated edge
capacity c(u, v), which is a non-negative number. If the edge (u, v) < G then c(u, v) = 0. In addition, there is a
source vertex s and a target sink vertex t.
The entities G, s, t and c(·) together form a flow network or simply a network. An example of such a
flow network is depicted in Figure 13.1.

u 12/12 v 15 We would like to transfer as much flow from the source s to the sink t.
6 /2
/1 0 Specifically, all the flow starts from the source vertex, and ends up in the
s 11
sink. The flow on an edge is a non-negative quantity that can not exceed
10 1/4
8/ 7/7 the capacity constraint for this edge. One possible flow is depicted on
9

13 t
4/

the left figure, where the numbers a/b on an edge denote a flow of a units
4
4/ on an edge with capacity at most b.
w x
11/14 We next formalize our notation of a flow.

95
Definition 13.1.2 (flow). A flow in a network is a function f (·, ·) on the edges of G such that:
(A) Bounded by capacity: For any edge (u, v) ∈ E, we have f (u, v) ≤ c(u, v).
Specifically, the amount of flow between u and v on the edge (u, v) never exceeds its capacity c(u, v).
(B) Anti symmetry: For any u, v we have f (u, v) = − f (v, u).
(C) There are two special vertices: (i) the source vertex s (all flow starts from the source), and the sink
vertex t (all the flow ends in the sink). Õ
(D) Conservation of flow: For any vertex u ∈ V \ {s, t}, we have f (u, v) = 0.¬ Namely, for any internal
v
node, all the flow that flows into a vertex leaves this vertex.
Õ
The amount of flow (or simply flow) of f , called the value of f , is | f | = f (s, v).
v ∈V

Note, that a flow on an edge can be negative (i.e., there is a positive flow flowing on this edge in the other
direction).

Problem 13.1.3 (Maximum flow). Given a network G find the maximum flow in G. Namely, compute a legal
flow f such that | f | is maximized.

13.2. Some properties of flows and residual networks


For two sets X,Y ⊆ V, let f (X,Y ) = f (x, y). We will slightly abuse the notations and refer to f {v} , S
Í 
x ∈X,y ∈Y
by f (v, S), where v ∈ V(G).

Observation 13.2.1. By definition, we have | f | = f (s, V).

Lemma 13.2.2. For a flow f , the following properties holds:


(i) ∀u ∈ V(G) we have f (u, u) = 0,
(ii) ∀X ⊆ V we have f (X, X) = 0,
(iii) ∀X,Y ⊆ V we have f (X,Y ) = − f (Y, X),
(iv) ∀X,Y, Z ⊆ V such that X∩Y = ∅ we have that f (X∪Y, Z) = f (X, Z)+ f (Y, Z) and f (Z, X∪Y ) = f (Z, X)+ f (Z,Y ).
(v) For all u ∈ V \ {s, t}, we have f (u, V) = f (V, u) = 0.

Proof: Property (i) holds since (u, u) is not an edge in the graph, and as such its flow is zero. As for property
(ii), we have
Õ Õ Õ Õ
f (X, X) = ( f (u, v) + f (v, u)) + f (u, u) = ( f (u, v) − f (u, v)) + 0 = 0,
{u,v } ⊆X,u,v u ∈X {u,v } ⊆X,u,v u ∈X

by the anti-symmetry property of flow (Definition 13.1.2 (B)).


Property (iii) holds immediately by the anti-symmetry of flow, as
Õ Õ
f (X,Y ) = f (x, y) = − f (y, x) = − f (Y, X).
x ∈X,y ∈Y x ∈X,y ∈Y

(iv) This case follows immediately from definition.


Finally (v) is a restatement of the conservation of flow property.

Claim 13.2.3. | f | = f (V, t).


¬ This law for electric circuits is known as Kirchhoff’s Current Law.

96
12
u 12/12 v 15 u v
6 /2 5
/1 0 5
s 11 s 15
10 1/4 11
11 3 7
7/7

4
8/

9
13 t 5 t

4/

5
4 8 4
4/ 11 x
w x w
11/14 3
(i) (ii)

Figure 13.2: (i) A flow network, and (ii) the resulting residual network. Note, that f (u, w) = − f (w, u) = −1 and
as such c f (u, w) = 10 − (−1) = 11.

Proof: We have:

| f | = f (s, V) = f V \ (V \ {s}), V


= f (V, V) − f (V \ {s} , V)
= − f (V \ {s} , V) = f (V, V \ {s})
= f (V, t) + f (V, V \ {s, t})
Õ
= f (V, t) + f (V, u)
u ∈V\{s,t }
Õ
= f (V, t) + 0
u ∈V\{s,t }

= f (V, t),

since f (V, V) = 0 by Lemma 13.2.2 (i) and f (V, u) = 0 by Lemma 13.2.2 (iv).

Definition 13.2.4. Given capacity c and flow f , the residual capacity of an edge (u, v) is

c f (u, v) = c(u, v) − f (u, v).

Intuitively, the residual capacity c f (u, v) on an edge (u, v) is the amount of unused capacity on (u, v). We can
next construct a graph with all edges that are not being fully used by f , and as such can serve to improve f .

Definition 13.2.5. Given  f , G = (V, E) and c, as above, the residual graph (or residual network) of G and f
is the graph G f = V, E f where
E f = (u, v) ∈ V × V c f (u, v) > 0 .


Note, that by the definition of E f , it might be that an edge (u, v) that appears in E might induce two
edges in E f . Indeed, consider an edge (u, v) such that f (u, v) < c(u, v) and (v, u) is not an edge of G. Clearly,
c f (u, v) = c(u, v) − f (u, v) > 0 and (u, v) ∈ E f . Also,

c f (v, u) = c(v, u) − f (v, u) = 0 − (− f (u, v)) = f (u, v),

since c(v, u) = 0 as (v, u) is not an edge of G. As such, (v, u) ∈ E f . This states that we can always reduce the
flow on the edge (u, v) and this is interpreted as pushing flow on the edge (v, u). See Figure 13.2 for an example
of a residual network.
Since every edge of G induces at most two edges in G f , it follows that G f has at most twice the number of
edges of G; formally, E f ≤ 2 |E |.

Lemma 13.2.6. Given a flow f defined over a network G, then the residual network G f together with c f form
a flow network.

97
Proof: One need to verify that c f (·) is always a non-negative function, which is true by the definition of E f .

The following lemma testifies that we can improve a flow f on G by finding any legal flow h in the residual
network G f .

Lemma 13.2.7. Given a flow network G = (V, E), a flow f in G, and a flow h in G f , where G f is the residual
network of f . Then f + h is a (legal) flow in G and its capacity is | f + h| = | f | + |h|.

Proof: By definition, we have ( f + h)(u, v) = f (u, v) + h(u, v) and thus ( f + h)(X,Y ) = f (X,Y ) + h(X,Y ). We need
to verify that f + h is a legal flow, by verifying the properties required to it by Definition 13.1.2.
Anti symmetry holds since ( f + h)(u, v) = f (u, v) + h(u, v) = − f (v, u) − h(v, u) = −( f + h)(v, u).
Next, we verify that the flow f + h is bounded by capacity. Indeed,

( f + h)(u, v) ≤ f (u, v) + h(u, v) ≤ f (u, v) + c f (u, v) = f (u, v) + (c(u, v) − f (u, v)) = c(u, v).

For u ∈ V − s − t we have ( f + h)(u, V) = f (u, V) + h(u, V) = 0 + 0 = 0 and as such f + h comply with the
conservation of flow requirement.
Finally, the total flow is

| f + h| = ( f + h)(s, V) = f (s, V) + h(s, V) = | f | + |h| .

Definition 13.2.8. For G and a flow f , a path π in G f between s and 


t is an augmenting path.  


Note, that all the edges of π has positive capacity in G f , since 
otherwise (by definition) they would not appear in E f . As such, given 
a flow f and an augmenting path π, we can improve f by pushing a
positive amount of flow along the augmenting path π. An augmenting  
path is depicted on the right, for the network flow of Figure 13.2.

Definition 13.2.9. For an augmenting path π let c f (π) be the max- Figure 13.3: An augmenting path for
imum amount of flow we can push through π. We call c f (π) the the flow of Figure 13.2.
residual capacity of π. Formally,

c f (π) = min c f (u, v).


(u,v)∈π

We can now define a flow that realizes the flow along 12/12
u v
π. Indeed: 15/16 19/20

 c f (π) if (u, v) is in π
 s

 4/10 0/9
fπ (u, v) = −c f (π) if (v, u) is in π
7/7
1/4
t
0 otherwise.


 8/13 4/4

w 11/14
x
Lemma 13.2.10. For an augmenting path π, the flow
fπ is a flow in G f and | fπ | = c f (π) > 0. Figure 13.4: The flow resulting from applying the
residual flow fp of the path p of Figure 13.3 to the
We can now use such a path to get a larger flow. flow of Figure 13.2.

Lemma 13.2.11. Let f be a flow, and let π be an augmenting path for f . Then f + fπ is a “better” flow.
Namely, | f + fπ | = | f | + | fπ | > | f |.

98
Namely, f + fπ is flow with larger value than f . Consider the flow in Figure 13.4. 




Can we continue improving it? Well, if you inspect the residual network of  
 
this flow, depicted on the right. Observe that s is disconnected from t in 

this residual network. So, we are unable to push any more flow. Namely, we  

found a solution which is a local maximum solution for network flow. But 
 
 
is that a global maximum? Is this the maximum flow we are looking for?

13.3. The Ford-Fulkerson method


mtdFordFulkerson(G, c)
begin Given a network G with capacity constraints c, the above discus-
f ← Zero flow on G sion suggest a simple and natural method to compute a maximum
while (G f has augmenting flow. This is known as the Ford-Fulkerson method for comput-
path p) do ing maximum flow, and is depicted on the left, we will refer to it
(* Recompute G f for as the mtdFordFulkerson method.
this check *) It is unclear that this method (and the reason we do not refer to
f ← f + fp it as an algorithm) terminates and reaches the global maximum
return f flow. We address these problems shortly.
end

13.4. On maximum flows


We need several natural concepts.

Definition 13.4.1. A directed cut (S,T) in a flow network G = (V, E) is a partition of V into S and T = V \ S,
such that s ∈ S and t ∈ T. We usually will refer to a directed cut as being a cut.
The net flow of f across a cut (S,T) is f (S,T) = s ∈S,t ∈T f (s, t).
Í
The capacity of (S,T) is c(S,T) = s ∈S,t ∈T c(s, t).
Í
The minimum cut is the cut in G with the minimum capacity.

Lemma 13.4.2. Let G, f ,s,t be as above, and let (S,T) be a cut of G. Then f (S,T) = | f |.

Proof: We have

f (S,T) = f (S, V) − f (S, S) = f (S, V) = f (s, V) + f (S − s, V) = f (s, V) = | f | ,

since T = V \ S, and f (S − s, V) = u ∈S−s f (u, V) = 0 by Lemma 13.2.2 (v) (note that u can not be t as t ∈ T).
Í

Claim 13.4.3. The flow in a network is upper bounded by the capacity of any cut (S,T) in G.

Proof: Consider a cut (S,T). We have | f | = f (S,T) = f (u, v) ≤ c(u, v) = c(S,T).


Í Í
u ∈S,v ∈T u ∈S,v ∈T

In particular, the maximum flow is bounded by the capacity of the minimum cut. Surprisingly, the maximum
flow is exactly the value of the minimum cut.

Theorem 13.4.4 (Max-flow min-cut theorem). If f is a flow in a flow network G = (V, E) with source s
and sink t, then the following conditions are equivalent:
(A) f is a maximum flow in G.
(B) The residual network G f contains no augmenting paths.

99
(C) | f | = c(S,T) for some cut (S,T) of G. And (S,T) is a minimum cut in G.

Proof: (A) ⇒ (B): By contradiction. If there was an augmenting path p then c f (p) > 0, and we can generate a
new flow f + fp , such that f + fp = | f | + c f (p) > | f | . A contradiction as f is a maximum flow.
(B) ⇒ (C): Well, it must be that s and t are disconnected in G f . Let
n o
S = v Exists a path between s and v in G f

and T = V \ S. We have that s ∈ S, t ∈ T, and for any u ∈ S and v ∈ T we have f (u, v) = c(u, v). Indeed, if
there were u ∈ S and v ∈ T such that f (u, v) < c(u, v) then (u, v) ∈ E f , and v would be reachable from s in G f ,
contradicting the construction of T.
This implies that | f | = f (S,T) = c(S,T). The cut (S,T) must be a minimum cut, because otherwise there
would be cut (S 0,T 0) with smaller capacity c(S 0,T 0) < c(S,T) = f (S,T) = | f |, On the other hand, by Lemma 13.4.3,
we have | f | = f (S 0,T 0) ≤ c(S 0,T 0). A contradiction.
(C) ⇒ (A) Well, for any cut (U, V), we know that | f | ≤ c(U, V). This implies that if | f | = c(S,T) then the
flow can not be any larger, and it is thus a maximum flow.

The above max-flow min-cut theorem implies that if mtdFordFulkerson terminates, then it had computed
the maximum flow. What is still allusive is showing that the mtdFordFulkerson method always terminates.
This turns out to be correct only if we are careful about the way we pick the augmenting path.

Chapter 14

Network Flow II - The Vengeance

14.1. Accountability

The comic in Figure 14.1 is by Jonathan Shewchuk


and is referring to the Calvin and Hobbes comics.
People that do not know maximum flows: es-
sentially everybody.
Average salary on earth < $5, 000
People that know maximum flow - most of them
work in programming related jobs and make at
least $10, 000 a year.
Salary of people that learned maximum flows:
> $10, 000
Salary of people that did not learn maximum
flows: < $5, 000
Salary of people that know Latin: 0 (unem- Figure 14.1: http://www.cs.berkeley.edu/~jrs/
ployed).

100
Thus, by just learning maximum flows (and not knowing Latin) you can double your future
salary!

14.2. The Ford-Fulkerson Method mtdFordFulkerson(G,s,t)


Initialize flow f to zero
The mtdFordFulkerson method is depicted on the right. while ∃ path π fromn s to t in G f doo
Lemma 14.2.1. If the capacities on the edges of G are in- c f (π) ← min c f (u, v) (u, v) ∈ π
tegers, then mtdFordFulkerson runs in O(m | f ∗ |) time, where for ∀(u, v) ∈ π do
| f ∗ | is the amount of flow in the maximum flow and m = f (u, v) ← f (u, v) + c f (π)
|E(G)|. f (v, u) ← f (v, u) − c f (π)
Proof: Observe that the mtdFordFulkerson method performs only subtraction, addition and min operations.
Thus, if it finds an augmenting path π, then c f (π) must be a positive integer number. Namely, c f (π) ≥ 1. Thus,
| f ∗ | must be an integer number (by induction), and each iteration of the algorithm improves the flow by at
least 1. It follows that after | f ∗ | iterations the algorithm stops. Each iteration takes O(m + n) = O(m) time, as
can be easily verified.

The following observation is an easy consequence of our discussion.

Observation 14.2.2 (Integrality theorem). If the capacity function c takes on only integral values, then
the maximum flow f produced by the mtdFordFulkerson method has the property that | f | is integer-valued.
Moreover, for all vertices u and v, the value of f (u, v) is also an integer.

14.3. The Edmonds-Karp algorithm


The Edmonds-Karp algorithm works by modifying the mtdFordFulkerson method so that it always returns the
shortest augmenting path in G f (i.e., path with smallest number of edges). This is implemented by finding π
using BFS in G f .

Definition 14.3.1. For a flow f , let δ f (v) be the length of the shortest path from the source s to v in the residual
graph G f . Each edge is considered to be of length 1.

We will shortly prove that, for any vertex v ∈ V \{s, t}, the function δ f (v), in the residual network G f , increases
monotonically with each flow augmentation. We delay proving this (key) technical fact (see Lemma 14.3.5
below), and first show its implications.

Lemma 14.3.2. During the execution of the Edmonds-Karp algorithm, an edge (u, v) might disappear (and thus
reappear) from G f at most n/2 times throughout the execution of the algorithm, where n = |V(G)|.

Proof: Consider an iteration when the edge (u, v) disappears. Clearly, in this iteration the edge (u, v) appeared in
the augmenting path π. Furthermore, this edge was fully utilized; namely, c f (π) = c f (uv), where f is the flow in
the beginning of the iteration when it disappeared. We continue running Edmonds-Karp till (u, v) “magically”
reappears. This means that in the iteration before (u, v) reappeared in the residual graph, the algorithm handled
an augmenting path σ that contained the reverse edge (v, u). Let g be the flow used to compute σ. We have,
by the monotonicity of δ(·) [i.e., Lemma 14.3.5 below], that

δg (u) = δg (v) + 1 ≥ δ f (v) + 1 = δ f (u) + 2

as Edmonds-Karp is always augmenting along the shortest path. Namely, the distance of s to u had increased
by 2 between its disappearance and its reappearance. Since δ0 (u) ≥ 0 and the maximum value of δ(u) is n, it
follows that (u, v) can disappear and reappear at most n/2 times during the execution of the Edmonds-Karp
algorithm.

101
(i) (ii) (iii)

Figure 14.2: (i) A bipartite graph. (ii) A maximum matching in this graph. (iii) A perfect matching (in a
different graph).

Observe that δ(u) might become infinity at some point during the algorithm execution (i.e., u is no longer
reachable from s). If so, by monotonicity, the edge (u, v) would never appear again, in the residual graph, in
any future iteration of the algorithm.

Observation 14.3.3. Every time we add an augmenting path during the execution of the Edmonds-Karp algo-
rithm, at least one edge disappears from the residual graph. Indeed, every edge that realizes the residual capacity
(of the augmenting path) will disappear once we push the maximum possible flow along this path.

Lemma 14.3.4. The Edmonds-Karp algorithm handles at most O(nm) augmenting paths before it stops. Its
running time is O nm , where n = |V(G)| and m = |E(G)|.
2


Proof: Every edge might disappear at most n/2 times during Edmonds-Karp execution, by Lemma 14.3.2. Thus,
there are at most nm/2 edge disappearances during the execution of the Edmonds-Karp algorithm. At each
iteration, we perform path augmentation, and at least one edge disappears along it from the residual graph.
Thus, the Edmonds-Karp algorithm perform at most O(mn) iterations.
Performing a single iteration of the algorithm boils down to computing an augmenting path. Computing
such a path takes O(m) time as we have to
 perform BFS to find the augmenting path. It follows, that the overall
running time of the algorithm is O nm2 .

We still need to prove the aforementioned monotonicity property. (This is the only part in our discussion
of network flow where the argument gets a bit tedious. So bear with us, after all, you are going to double your
salary here.)

Lemma 14.3.5. If the Edmonds-Karp algorithm is run on a flow network G = (V, E) with source s and sink
t, then for all vertices v ∈ V \ {s, t}, the shortest path distance δ f (v) in the residual network G f increases
monotonically with each flow augmentation.

Proof: Assume, for the sake of contradiction, that this is false. Consider the flow just after the first iteration
when this claim failed. Let f denote the flow before this (fatal) iteration was performed, and let g be the flow
after.
Let v be the vertex such that δg (v) is minimal, among all vertices for which the monotonicity fails. Formally,
this is the vertex v where δg (v) is minimal and

δg (v) < δ f (v). (*)

Let πg = s → · · · → u → v be the shortest path in Gg from s to v. Clearly, (u, v) ∈ E Gg , and thus




δg (u) = δg (v) − 1.

102
v u

πg

s
By the choice of v it must be that δg (u) ≥ δ f (u), since otherwise the monotonicity property fails for u, and
u is closer to s than v in Gg , and this, in turn, contradicts our choice of v as being the closest vertex to s that
fails the monotonicity property. There are now two possibilities:

(i) If (u, v) ∈ E G f then

δ f (v) ≤ δ f (u) + 1 ≤ δg (u) + 1 = δg (v) − 1 + 1 = δg (v).

This contradicts our assumptions that δ f (v) > δg (v).


(ii) If (u, v) is not in E G f then the augmenting path σf used in computing g from f contains the edge (v, u).


Indeed, the edge (u, v) reappeared in the residual graph Gg (while not being present in G f ). The only
way this can happens is if the augmenting path σf pushed a flow in the other direction on the edge (u, v).
Namely, (v, u) ∈ σf .
t

v u

πg
σf

s
However, the algorithm always augment along the shortest path. We have that

δ f (u) = δ f (v) + 1 > δg (v) + 1 > δg (v) = δg (u) + 1,


|{z}
(∗)

by the definition of u. Thus, δ f (u) > δg (u) (i.e., the monotonicity property fails for u) and δg (u) < δg (v).
A contradiction to the choice of v.

14.4. Applications and extensions for Network Flow


14.4.1. Maximum Bipartite Matching

103
Definition 14.4.1. For an undirected graph G = (V, E) a matching is a subset
of edges M ⊆ E such that for all vertices v ∈ V, at most one edge of M is 1
incident on v.
s t
A maximum matching is a matching M such that for any matching M0 we
have |M | ≥ |M 0 |. 1

A matching is perfect if it involves all vertices. See Figure 14.2 for examples
1
of these definitions.
Figure 14.3
Theorem 14.4.2. One can compute maximum bipartite matching using network flow in O(nm) time, for a
bipartite graph with n vertices and m edges.

Proof: Given a bipartite graph G, we create a new graph with a new source on the left side and sink on the
right, see Figure 14.3.
Direct all edges from left to right and set the capacity of all edges to 1. Let H be the resulting flow network.
It is now easy to verify that by the Integrality theorem, a flow in H is either 0 or one on every edge, and thus
a flow of value k in H is just a collection of k vertex disjoint paths between s and t in H, which corresponds to
a matching in G of size k.
Similarly, given a matching of size k in G, it can be easily interpreted as realizing a flow in H of size k.
Thus, computing a maximum flow in H results in computing a maximum matching in G.
The running time of the algorithm is O(nm), since one has to do at most n/2 augmentations, and each
augmentation takes O(m) time.

14.4.2. Extension: Multiple Sources and Sinks


Given a flow network with several sources and sinks, how can we compute maximum flow on such a network?
The idea is to create a super source, that send all its flow to the old sources and similarly create a super
sink that receives all the flow. See Figure 14.4. Clearly, computing flow in both networks in equivalent.

s1 t1 s1 t1
∞ ∞
s
∞ t
s2 s2 ∞

t2 t2

(i) (ii)

Figure 14.4: (i) A flow network with several sources and sinks, and (ii) an equivalent flow network with a single
source and sink.

104
Chapter 15

Network Flow III - Applications

15.1. Edge disjoint paths


15.1.1. Edge-disjoint paths in a directed graphs
Question 15.1.1. Given a graph G (either directed or undirected), two vertices s and t, and a parameter k,
the task is to compute k paths from s to t in G, such that they are edge disjoint; namely, these paths do not
share an edge.

To solve this problem, we will convert G (assume G is a directed graph for the time being) into a network
flow graph J, such that every edge has capacity 1. Find the maximum flow in J (between s and t). We claim
that the value of the maximum flow in the network J, is equal to the number of edge disjoint paths in G.

Lemma 15.1.2. If there are k edge disjoint paths in G between s and t, then the maximum flow value in J is
at least k.

Proof: Given k such edge disjoint paths, push one unit of flow along each such path. The resulting flow is legal
in h and it has value k.

Definition 15.1.3 (0/1-flow). A flow f is a 0/1-flow if every edge has either no flow on it, or one unit of flow.

Lemma 15.1.4. Let f be a 0/1 flow in a network J with flow value µ. Then there are µ edge disjoint paths
between s and t in J.

Proof: By induction on the number of edges in J that has one unit of flow assigned to them by f . If µ = 0 then
there is nothing to prove.
Otherwise, start traversing the graph J from s traveling only along edges with flow 1 assigned to them by f .
We mark such an edge as used, and do not allow one to travel on such an edge again. There are two possibilities:
(i) We reached the target vertex t. In this case, we take this path, add it to the set of output paths, and
reduce the flow along the edges of the generated path π to 0. Let H 0 be the resulting flow network and f 0 the
resulting flow. We have | f 0 | = µ − 1, H 0 has less edges, and by induction, it has µ − 1 edge disjoint paths in H 0
between s and t. Together with π this forms µ such paths.
(ii) We visit a vertex v for the second time. In this case, our traversal contains a cycle C, of edges in J that
have flow 1 on them. We set the flow along the edges of C to 0 and use induction on the remaining graph (since
it has less edges with flow 1 on them). The value of the flow f did not change by removing C, and as such it
follows by induction that there are µ edge disjoint paths between s and t in J.

Since the graph G is simple, there are at most n = |V(J)| edges that leave s. As such, the maximum flow in
J is ≤ n. Thus, applying the Ford-Fulkerson algorithm, takes O(mn) time. The extraction of the paths can also
be done in linear time by applying the algorithm in the proof of Lemma 15.1.4. As such, we get:

Theorem 15.1.5. Given a directed graph G with n vertices and m edges, and two vertices s and t, one can
compute the maximum number of edge disjoint paths between s and t in G, in O(mn) time.

105
As a consequence we get the following “cute” result.
Theorem 15.1.6 (Menger’s theorem). In a directed graph G with nodes s and t the maximum number of
edge disjoint s − t paths is equal to the minimum number of edges whose removal separates s from t.

Proof: Let U be a collection of edge-disjoint paths from s to t in G. If we remove a set F of edges from G
and separate s from t, then it must be that every path in U uses at least one edge of F. Thus, the number
of edge-disjoint paths is bounded by the number of edges needed to be removed to separate s and t. Namely,
|U| ≤ |F |.
As for the other direction, let F be a set of edges that its removal separates s and t. We claim that the
set F form a cut in G between s and t. Indeed, let S be the set of all vertices in G that are reachable from s
without using an edge of F. Clearly, if F is minimal then it must be all the edges of the cut (S,T) (in particular,
if F contains some edge which is not in (S,T) we can remove it and get a smaller separating set of edges). In
particular, the smallest set F with this separating property has the same size as the minimum cut between s
and t in G, which is by the max-flow mincut theorem, also the maximum flow in the graph G (where every edge
has capacity 1).
But then, by Theorem 15.1.5, there are |F | edge disjoint paths in G (since |F | is the amount of the maximum
flow).

15.1.2. Edge-disjoint paths in undirected graphs


We would like to solve the s-t disjoint path problem for an undirected graph.
Problem 15.1.7. Given undirected graph G, s and t, find the maximum number of edge-disjoint paths in G
between s and t.

The natural approach is to duplicate every edge in the undirected graph G, and get a (new) directed graph
J. Next, apply the algorithm of Section 15.1.1 to J.
So compute for J the maximum flow f (where every edge has capacity 1). The problem is the flow f might
use simultaneously the two edges (u, v) and (v, u). Observe, however, that in such case we can remove both edges
from the flow f . In the resulting flow is legal and has the same value. As such, if we repeatedly remove those
“double edges” from the flow f , the resulting flow f 0 has the same value. Next, we extract the edge disjoint
paths from the graph, and the resulting paths are now edge disjoint in the original graph.
Lemma 15.1.8. There are k edge-disjoint paths in an undirected graph G from s to t if and only if the maximum
value of an s − t flow in the directed version J of G is at least k. Furthermore, the Ford-Fulkerson algorithm can
be used to find the maximum set of disjoint s-t paths in G in O(mn) time.

15.2. Circulations with demands


15.2.1. Circulations with demands

We next modify and extend the network flow problem. Let G = (V, E) be a
−3
directed graph with capacities on the edges. Each vertex v has a demand dv :
• dv > 0: sink requiring dv flow into this node. 3 3

• dv < 0: source with −dv units of flow leaving it. 2


−3 2
• dv = 0: regular node. 2
2
Let S denote all the source vertices and T denote all the sink/target vertices.
For a concrete example of an instance of circulation with demands, see figure 4
on the right.
Figure 15.1: Instance of
106 circulation with demands.
Definition 15.2.1. A circulation with demands {dv } is a function f that as- −3
signs nonnegative real values to the edges of G, such that: 2/
1/3 3
• Capacity condition: ∀e ∈ E we have f (e) ≤ c(e).
2/2
• Conservation condition: ∀v ∈ V we have f in (v) − f out (v) = dv . −3 2

2/2
Here, for a vertex v, let f in (v) denotes the flow into v and f out (v) denotes the 2/2
flow out of v.
4
Problem 15.2.2. Is there a circulation that comply with the demand require-
ments? Figure 15.2: A valid circula-
tion for the instance of Fig-
See Figure 15.1 and Figure 15.2 for an example. ure 15.1.
Lemma 15.2.3. If there is a feasible circulation with demands {dv }, then v dv = 0.
Í

Proof: Since it is a circulation, we have that dv = f in (v) − f out (v). Summing over all vertices: v dv =
Í
f f
Í in Í out
v (v) − v (v). The flow on every edge is summed twice, one with positive sign, one with negative sign.
As such, Õ Õ Õ
dv = f in (v) − f out (v) = 0,
v v v
which implies the claim.

In particular, this implies that there is a feasible solution only if


Õ Õ
D= dv = −dv .
v,dv >0 v,dv <0

15.2.1.1. The algorithm for computing a circulation


The algorithm performs the following steps:
(A) G = (V, E) - input flow network with demands on vertices.
(B) Check that D = v,dv >0 dv = v,dv <0 −dv .
Í Í
(C) Create a new super source s, and connect it to all the vertices v with dv < 0. Set the capacity of the edge
(s, v) to be −dv .
(D) Create a new super target t. Connect to it all the vertices u with du > 0. Set capacity on the new edge
(u, t) to be du .
(E) On the resulting network flow network J (which is a standard instance of network flow). Compute
maximum flow on J from s to t. If it is equal to D, then there is a valid circulation, and it is the flow
restricted to the original graph. Otherwise, there is no valid circulation.
Theorem 15.2.4. There is a feasible circulation with demands {dv } in G if and only if the maximum s-t flow
in J has value D. If all capacities and demands in G are integers, and there is a feasible circulation, then there
is a feasible circulation that is integer valued.

15.3. Circulations with demands and lower bounds


Assume that in addition to specifying a circulation and demands on a network G, we also specify for each edge a
lower bound on how much flow should be on each edge. Namely, for every edge e ∈ E(G), we specify `(e) ≤ c(e),
which is a lower bound to how much flow must be on this edge. As before we assume all numbers are integers.
We need now to compute a flow f that fill all the demands on the vertices, and that for any edge e, we have
`(e) ≤ f (e) ≤ c(e). The question is how to compute such a flow?

107
Let use start from the most naive flow, which transfer on every edge, exactly its lower bound. This is a
valid flow as far as capacities and lower bounds, but of course, it might violate the demands. Formally, let
f0 (e) = `(e), for all e ∈ E(G). Note that f0 does not even satisfy the conservation rule:
Õ Õ
Lv = f0in (v) − f0out (v) = `(e) − `(e).
e into v e out o f v

If Lv = dv , then we are happy, since this flow satisfies the required demand. Otherwise, there is imbalance
at v, and we need to fix it.
Formally, we set a new demand dv0 = dv − Lv for every node v, and the capacity of every edge e to be
c 0(e) = c(e) − `(e). Let G0 denote the new network with those capacities and demands (note, that the lower
bounds had “disappeared”). If we can find a circulation f 0 on G0 that satisfies the new demands, then clearly,
the flow f = f0 + f 0, is a legal circulation, it satisfies the demands and the lower bounds.
But finding such a circulation, is something we already know how to do, using the algorithm of Theo-
rem 15.2.4. Thus, it follows that we can compute a circulation with lower bounds.

Lemma 15.3.1. There is a feasible circulation in G if and only if there is a feasible circulation in G0.
If all demands, capacities, and lower bounds in G are integers, and there is a feasible circulation, then there
is a feasible circulation that is integer valued.

Proof: Let f 0 be a circulation in G0. Let f (e) = f0 (e) + f 0(e). Clearly, f satisfies the capacity condition in G,
and the lower bounds. Furthermore,
Õ Õ
f in (v) − f out (v) = (`(e) + f 0(e)) − (`(e) + f 0(e)) = Lv + (dv − Lv ) = dv .
e into v e out o f v

As such f satisfies the demand conditions on G.


Similarly, let f be a valid circulation in G. Then it is easy to check that f 0(e) = f (e) − `(e) is a valid
circulation for G0.

15.4. Applications
15.4.1. Survey design
We would like to design a survey of products used by consumers (i.e., “Consumer i: what did you think of
product j?”). The ith consumer agreed in advance to answer a certain number of questions in the range [ci , ci0].
Similarly, for each product j we would like to have at least p j opinions about it, but not more than p0j . Each
consumer can be asked about a subset of the products which they consumed. In particular, we assume that
we know in advance all the products each consumer used, and the above constraints. The question is how to
assign questions to consumers, so that we get all the information we want to get, and every consumer is being
asked a valid number of questions.
The idea of our solution is to reduce the design of the survey to the problem of computing a circulation in
graph. First, we build a bipartite graph having consumers on one side, and products on the other side. Next,
we insert the edge between consumer i and product j if the product was used by this consumer. The capacity
of this edge is going to be 1. Intuitively, we are going to compute a flow in this network which is going to be
an integer number. As such, every edge would be assigned either 0 or 1, where 1 is interpreted as asking the
consumer about this product.

108
The next step, is to connect a source to all the consumers, where the 0, 1
edge (s, i) has lower bound ci and upper bound ci . Similarly, we connect
0
ci, c0i pj , p0j
all the products to the destination t, where ( j, t) has lower bound p j and
upper bound p0j . We would like to compute a flow from s to t in this
network that comply with the constraints. However, we only know how s t
to compute a circulation on such a network. To overcome this, we create
an edge with infinite capacity between t and s. Now, we are only looking
for a valid circulation in the resulting graph G which complies with the
aforementioned constraints. See figure on the right for an example of G.
Given a circulation f in G it is straightforward to interpret it as a
survey design (i.e., all middle edges with flow 1 are questions to be asked in the survey). Similarly, one can
verify that given a valid survey, it can be interpreted as a valid circulation in G. Thus, computing circulation
in G indeed solves our problem.
We summarize:

Lemma 15.4.1. Given n consumers and u products with their constraints c1, c10 , c2, c20 , . . . , cn, cn0 , p1, p10 , . . . , pu , pu0
and a list of length m of which products where used by which consumers. An algorithm can compute a valid
survey under these constraints, if such a survey exists, in time O((n + u)m2 ).

Chapter 16

Network Flow IV - Applications II

16.1. Airline Scheduling


Problem 16.1.1. Given information about flights that an airline needs to provide, generate a profitable schedule.

The input is a detailed information about “legs” of flight that the airline need to serve. We denote this set
of flights by F. We would like to find the minimum number of airplanes needed to carry out this schedule. For
an example of possible input, see Figure 16.1 (i).
We can use the same airplane for two segments i and j if the destination of i is the origin of the segment j
and there is enough time in between the two flights for required maintenance. Alternatively, the airplane can
fly from dest(i) to origin( j) (assuming that the time constraints are satisfied).

Example 16.1.2. As a concrete example, consider the flights:


(A) Boston (depart 6 A.M.) - Washington D.C. (arrive 7 A.M,).
(B) Washington (depart 8 A.M.) - Los Angeles (arrive 11 A.M.)
(C) Las Vegas (depart 5 P.M.) - Seattle (arrive 6 P.M.)
This schedule can be served by a single airplane by adding the leg “Los Angeles (depart 12 noon)- Las Vegas
(1 P,M.)” to this schedule.

109
2
1: Boston (depart 6 A.M.) - Washington DC (arrive 7 A.M,).
1
2: Urbana (depart 7 A.M.) - Champaign (arrive 8 A.M.)
3: Washington (depart 8 A.M.) - Los Angeles (arrive 11 A.M.) 3
4: Urbana (depart 11 A.M.) - San Francisco (arrive 2 P.M.)
5: San Francisco (depart 2:15 P.M.) - Seattle (arrive 3:15 P.M.)
6
6: Las Vegas (depart 5 P.M.) - Seattle (arrive 6 P.M.).
4
5
(i) (ii)

Figure 16.1: (i) a set F of flights that have to be served, and (ii) the corresponding graph G representing these
flights.

16.1.1. Modeling the problem


The idea is to model the feasibility constraints by a graph. Specifically, G is going to be a directed graph over
the flight legs. For i and j, two given flight legs, the edge (i, j) will be present in the graph G if the same airplane
can serve both i and j; namely, the same airplane can perform leg i and afterwards serves the leg j.
Thus, the graph G is acyclic. Indeed, since we can have an edge (i, j) only if the flight j comes after the
flight i (in time), it follows that we can not have cycles.
We need to decide if all the required legs can be served using only k airplanes?

16.1.2. Solution
The idea is to perform a reduction of this problem to the computation of circulation. Specifically, we construct
a graph J, as follows:
• For every leg i, we introduce two vertices ui , vi ∈ VJ. k
We also add a source vertex s and a sink vertex t to J.
1, 1
We set the demand at t to be k, and the demand at s u1 v1
to be −k (i.e., k units of flow are leaving s and need to 1, 1
arrive to t). • Each flight on the list must be served. u2 v2
This is forced by introducing an edge ei = (ui , vi ), for 1, 1
each leg i. We also set the lower bound on ei to be u3 v3
1, and the capacity on ei to be 1 (i.e., `(ei ) = 1 and −k k
1, 1
c(ei ) = 1). • If the same plane can perform flight i and j s u4 v4 t

(i.e., (i, j) ∈ E(G)) then add an edge vi , u j with capacity



1, 1
1 to J (with no lower bound constraint). • Since any u5 v5
airplane can start the day with flight i, we add an edge 1, 1
(s, ui ) with capacity 1 to J, for all flights i. • Similarly, u6 v6
any airplane can end the  day by serving the flight j.
Thus, we add edge v j , t with capacity 1 to G, for all
flights j. • If we have extra planes, we do not have to Figure 16.2: The resulting graph J for the instance
use them. As such, we introduce a “overflow” edge (s, t) of airline scheduling from Figure 16.1.
with capacity k, that can carry over all the unneeded
airplanes from s directly to t.
Let J denote the resulting graph. See Figure 16.2 for an example.

Lemma 16.1.3. There is a way to perform all flights of F using at most k planes if and only if there is a
feasible circulation in the network J.

110
(i) (ii)

Figure 16.3: The (i) input image, and (ii) a possible segmentation of the image.

Proof: Assume there is a way to perform the flights using k 0 ≤ k flights. Consider such a feasible schedule.
The schedule of an airplane in this schedule defines a path π in the network J that starts at s and ends at t,
and we send one unit of flow on each such path. We also send k − k 0 units of flow on the edge (s, t). Note,
that since the schedule is feasible, all legs are being served by some airplane. As such, all the “middle” edges
with lower-bound 1 are being satisfied. Thus, this results is a valid circulation in J that satisfies all the given
constraints.
As for the other direction, consider a feasible circulation in J. This is an integer valued circulation by the
Integrality theorem. Suppose that k 0 units of flow are sent between s and t (ignoring the flow on the edge (s, t)).
All the edges of J (except (s, t)) have capacity 1, and as such the circulation on all other edges is either zero
or one (by the Integrality theorem). We convert this into k 0 paths by repeatedly traversing from the vertex s
to the destination t, removing the edges we are using in each such path after extracting it (as we did for the k
disjoint paths problem). Since we never use an edge twice, and J is acyclic, it follows that we would extract k 0
paths. Each of those paths correspond to one airplane, and the overall schedule for the airplanes is valid, since
all required legs are being served (by the lower-bound constraint).

Extensions and limitations. There are a lot of other considerations that we ignored in the above problem:
(i) airplanes have to undergo long term maintenance treatments every once in awhile, (ii) one needs to allocate
crew to these flights, (iii) schedule differ between days, and (iv) ultimately we interested in maximizing revenue
(a much more fluffy concept and much harder to explicitly describe).
In particular, while network flow is used in practice, real world problems are complicated, and network flow
can capture only a few aspects. More than undermining the usefulness of network flow, this emphasize the
complexity of real-world problems.

16.2. Image Segmentation


In the image segmentation problem, the input is an image, and we would like to partition it into background
and foreground. For an example, see Figure 16.3.

111
The input is a bitmap on a grid where every grid node represents
a pixel. We convert this grid into a directed graph G, by interpreting
every edge of the grid as two directed edges. See the figure on the right
to see how the resulting graph looks like.
Specifically, the input for our problem is as follows: • A bitmap of
size N × N, with an associated directed graph G = (V, E). • For every
pixel i, we have a value fi ≥ 0, which is an estimate of the likelihood
of this pixel to be in foreground (i.e., the larger fi is the more probable
that it is in the foreground) • For every pixel i, we have (similarly) an
estimate bi of the likelihood of pixel i to be in background.
• For every two adjacent pixels i and j we have a separation penalty
pi j , which is the “price” of separating i from j. This quantity is defined
only for adjacent pixels in the bitmap. (For the sake of simplicity of
exposition we assume that pi j = p ji . Note, however, that this assumption is not necessary for our discussion.)

Problem 16.2.1. Given input as above, partition V (the set of pixels) into two disjoint subsets F and B, such
that
Õ Õ Õ
q(F, B) = fi + bi − pi j .
i ∈F i ∈B (i, j)∈E, |F∩{i, j } |=1

is maximized.

We can rewrite q(F, B) as:


Õ Õ Õ
q(F, B) = fi + bj − pi j
i ∈F j ∈B (i, j)∈E, |F∩{i, j } |=1

Õ ©Õ Õ Õ
= ( fi + bi ) − ­ fi + bj + pi j ® .
ª
i ∈v «i ∈B j ∈F (i, j)∈E, |F∩{i, j } |=1 ¬

i ∈v ( fi + bi ) is a constant, maximizing q(F, B) is equivalent to minimizing u(F, B), where


Í
Since the term
Õ Õ Õ
u(F, B) = fi + bj + pi j . (16.1)
i ∈B j ∈F (i, j)∈E, |F∩{i, j } |=1

How do we compute this partition. Well, the basic idea is to compute a minimum cut in a graph such that
its price would correspond to u(F, B). Before dwelling into the exact details, it is useful to play around with
some toy examples to get some intuition. Note, that we are using the max-flow algorithm as an algorithm for
computing minimum directed cut.
To begin with, consider a graph having a source s, a vertex i, and a sink t. We set the s fi i i
b
t
price of (s, i) to be fi and the price of the edge (i, t) to be bi . Clearly, there are two possible
cuts in the graph, either ({s, i} , {t}) (with a price bi ) or ({s} , {i, t}) (with a price fi ). In particular, every path
of length 2 in the graph between s and t forces the algorithm computing the minimum-cut (via network flow)
to choose one of the edges, to the cut, where the algorithm “prefers” the edge with lower price.
Next, consider a bitmap with two vertices i an j that are adjacent. Clearly, minimizing fi i bi
the first two terms in Eq. (16.1) is easy, by generating length two parallel paths between s t
s and t through i and j. See figure on the right. Clearly, the price of a cut in this graph fj j bj
is exactly the price of the partition of {i, j} into background and foreground sets. However, this ignores the
separation penalty pi j .

112
To this end, we introduce two new edges (i, j) and ( j, i) into the graph and set fi i bi
their price to be pi j . Clearly, a price of a cut in the graph can be interpreted as s p ij pij t
the value of u(F, B) of the corresponding sets F and B, since all the edges in the fj j bj
segmentation from nodes of F to nodes of B are contributing their separation price to the cut price. Thus, if
we extend this idea to the directed graph G, the minimum-cut in the resulting graph would corresponds to the
required segmentation.
Let us recap: Given the directed grid graph G = (V, E) we add two special source and sink vertices, denoted
by s and t respectively. Next, for all the pixels i ∈ V, we add an edge ei = (s, i) to the graph, setting its capacity
to be c(ei ) = fi . Similarly, we add the edge ei0 = ( j, t) with capacity c(ei0) = bi . Similarly, for every pair of vertices
i. j in that grid that are adjacent, we assign the cost pi j to the edges (i, j) and ( j, i). Let J denote the resulting
graph.
The following lemma, follows by the above discussion.

Lemma 16.2.2. A minimum cut (F, B) in J minimizes u(F, B).

Using the minimum-cut max-flow theorem, we have:

Theorem 16.2.3. One can solve the segmentation problem, in polynomial time, by computing the max flow in
the graph J.

16.3. Project Selection


You have a small company which can carry out some projects out of a set of projects P. Associated with each
project i ∈ P is a revenue pi , where pi > 0 is a profitable project and pi < 0 is a losing project. To make things
interesting, there is dependency between projects. Namely, one has to complete some “infrastructure” projects
before one is able to do other projects. Namely, you are provided with a graph G = (P, E) such that (i, j) ∈ E if
and only if j is a prerequisite for i.

Definition 16.3.1. A set X ⊂ P is feasible if for all i ∈ X, all the prerequisites of i are also in X. Formally, for
all i ∈ X, with an edge (i, j) ∈ E, we have j ∈ X.
The profit associated with a set of projects X ⊆ P is profit(X) = i ∈X pi .
Í

Problem 16.3.2 (Project Selection Problem). Select a feasible set of projects maximizing the overall profit.

The idea of the solution is to reduce the problem to a minimum-cut in a graph, in a similar fashion to what
we did in the image segmentation problem.

16.3.1. The reduction


The reduction works by adding two vertices s and t to the graph G, we also perform the following modifications:
• For all projects i ∈ P with positive revenue (i.e., pi > 0) add the ei = (s, i) to G and set the capacity of the
edge to be c(ei ) = pi , where s is the added source vertex.
• Similarly, for all projects j ∈ P, with negative revenue (i.e., p j < 0) add the edge e 0j = ( j, t) to G and set
the edge capacity to c(e 0j ) = −p j , where t is the added sink vertex.
• Compute a bound on the max flow (and thus also profit) in this network: C = i ∈P,pi >0 pi .
Í
• Set capacity of all other edges in G to 4C (these are the dependency edges in the project, and intuitively
they are too expensive to be “broken” by a cut).
Let J denote the resulting network.
Let X ⊆ P be a set of feasible projects, and let X 0 = X ∪ {s} and Y 0 = (P \ X) ∪ {t}. Consider the s-t cut
(X 0,Y 0) in J. Note, that no edge of E(G) is in (X 0,Y 0) since X is a feasible set (i.e., there is no u ∈ X 0 and v ∈ Y 0
such that (u, v) ∈ E(G)).

113
Lemma 16.3.3. The capacity of the cut (X 0,Y 0), as defined by a feasible project set X, is c(X 0,Y 0) = C− pi =
Í
i ∈X
C − profit(X).

Proof: The edges of J are either:


(i) original edges of G (conceptually, they have price +∞),
(ii) edges emanating from s, and
(iii) edges entering t.
Since X is feasible, it follows that no edges of type (i) contribute to the cut. The edges entering t contribute to
the cut the value
Õ
β= −pi .
i ∈X and pi <0

The edges leaving the source s contribute


Õ Õ Õ Õ
γ= pi = pi − pi = C − pi ,
i<X and pi >0 i ∈P,pi >0 i ∈X and pi >0 i ∈X and pi >0

by the definition of C.
The capacity of the cut (X 0,Y 0) is

Õ Õ Õ
β+γ = (−pi ) + ­C − pi ® = C − pi = C − profit(X),
© ª
i ∈X and pi <0 « i ∈X and p i >0 ¬ i ∈X

as claimed.

Lemma 16.3.4. If (X 0,Y 0) is a cut with capacity at most C in G, then the set X = X 0 \ {s} is a feasible set of
projects.
Namely, cuts (X 0,Y 0) of capacity ≤ C in J corresponds one-to-one to feasible sets which are profitable.

Proof: Since c(X 0,Y 0) ≤ C it must not cut any of the edges of G, since the price of such an edge is 4C. As such,
X must be a feasible set.

Putting everything together, we are looking for a feasible set X that maximizes profit(X) = i ∈X pi . This
Í
corresponds to a set X 0 = X ∪ {s} of vertices in J that minimizes C − i ∈X pi , which is also the cut capacity
Í
(X 0,Y 0). Thus, computing a minimum-cut in J corresponds to computing the most profitable feasible set of
projects.

Theorem 16.3.5. If (X 0,Y 0) is a minimum cut in J then X = X 0 \ {s} is an optimum solution to the project
selection problem. In particular, using network flow the optimal solution can be computed in polynomial time.

Proof: Indeed, we use network flow to compute the minimum cut in the resulting graph J. Note, that it is quite
possible that the most profitable project is still a net loss.

16.4. Baseball elimination


There is a baseball league taking place and it is nearing the end of the season. One would like to know which
teams are still candidates to winning the season.

Example 16.4.1. There 4 teams that have the following number of wins:

New York: 92, Baltimore: 91, Toronto: 91, Boston: 90,

114
and there are 5 games remaining (all pairs except New York and Boston).
We would like to decide if Boston can still win the season? Namely, can Boston finish the season with as
many point as anybody else? (We are assuming here that at every game the winning team gets one point and
the losing team gets nada.¬ )
First analysis. Observe, that Boston can get at most 92 wins. In particular, if New York wins any game
then it is over since New-York would have 93 points.
Thus, to Boston to have any hope it must be that both Baltimore wins against New York and Toronto wins
against New York. At this point in time, both teams have 92 points. But now, they play against each other,
and one of them would get 93 wins. So Boston is eliminated!
Second analysis. As before, Boston can get at most 92 wins. All three other teams gets X = 92 + 91 +
91 + (5 − 2) points together by the end of the league. As such, one of these three teams will get ≥ dX/3e = 93
points, and as such Boston is eliminated.

While the analysis of the above example is very cute, it is too tedious to be done each time we want to solve
this problem. Not to mention that it is unclear how to extend these analyses to other cases.

16.4.1. Problem definition


Problem 16.4.2. The input is a set S of teams, where for every team x ∈ S, the team has wx points accumulated
so far. For every pair of teams x, y ∈ S we know that there are gxy games remaining between x and y. Given a
specific team z, we would like to decide if z is eliminated?
Alternatively, is there a way such that z would get as many wins as anybody else by the end of the season?

16.4.2. Solution
First, we can assume that z wins all its remaining games, and let m be the number of points z has in this case.
Our purpose now is to build a network flow so we can verify that no other team must get more than m points.
To this end, let s be the source (i.e., the source of wins). For every remaining game, a flow of one unit would
go from s to one of the teams playing it. Every team can have at most m − wx flow from it to the target. If the
max flow in this network has value
Õ
α= gxy
x,y,z,x<y

(which is the maximum flow possible) then there is a scenario such that all other teams gets at most m points
and z can win the season. Negating this statement, we have that if the maximum flow is smaller than α then z
is eliminated, since there must be a team that gets more than m points.

Construction. Let S 0 = S \ {z} be the set of teams, and let


Õ
α= gxy . (16.2)
{x,y } ⊆S0

We create a network flow G. For every team x ∈ S 0 we add a vertex vx to the network G. We also add the
source and sink vertices, s and t, respectively, to G. vx m
For every pair of teams x, y ∈ S 0, such that gxy > 0 we create a node u xy, −
∞ w
x
and add an edge s, u xy with capacity gxy to G. We also add the edge u xy , vx

s gxy uxy
and u xy , vy with infinite capacity to G. Finally, for each team x we add the t
∞ wy
edge (vx , t) with capacity m − wx to G. How the relevant edges look like for a −
m
pair of teams
¬ nada x and y is depicted on the right.
= nothing. vy

115
Analysis. If there is a flow of value α in G then there is a way that all teams get at most m wins. Similarly,
if there exists a scenario such that z ties or gets first place then we can translate this into a flow in G of value
α. This implies the following result.
Theorem 16.4.3. Team z has been eliminated if and only if the maximum flow in G has value strictly smaller
than α. Thus, we can test in polynomial time if z has been eliminated.

16.4.3. A compact proof of a team being eliminated


Interestingly, once z is eliminated, we can generate a compact proof of this fact.
Theorem 16.4.4. Suppose that team z has been eliminated. Then there exists a “proof” of this fact of the
following form:
(A) The team z can finish with at most m Õwins. Õ
(B) There is a set of teams S ⊂ S so that
b wx + gxy > m Sb .
s ∈S
b {x,y } ⊆ S
b

(And hence one of the teams in Sb must end with strictly more than m wins.)
Proof: If z is eliminated then the max flow in G has value γ, which is smaller than α, see Eq. (16.2).
n By the
o
minimum-cut max-flow theorem, there exists a minimum cut (S,T) of capacity γ in G, and let Sb = x vx ∈ S
Claim 16.4.5. For any two teams x and y for which the vertex u xy exists, we have that u xy ∈ S if
and only if both x and y are in S.
b
 
Proof: x < S
b or y < S
b =⇒ u xy < S : If x is not in Sb then vx is in T. But then, if u xy is in S the
edge u xy , vx is in the cut. However, this edge has infinite capacity, which implies this cut is not a


minimum cut (in particular, (S,T) is a cut with capacity smaller than α). As such, in such a case
u xy must be in T. This implies that if either x or y are not in Sb then it must be that u xy ∈ T. (And
as such u xy < S.) v x m
x ∈ Sb and y ∈ Sb =⇒ u xy ∈ S : Assume that both x and y are in −
∞ w
x
S,
b then vx and vy are in S. We need to prove that u xy ∈ S. If u xy ∈ T s gxy uxy t
then consider the new cut formed by moving u xy to S. For the new ∞ wy

cut (S 0,T 0) we have m
vy
c(S 0,T 0) = c(S,T) − c (s, u xy ) .


Namely, the cut (S 0,T 0) has a lower capacity than the minimum cut (S,T), which is a contradiction.
See figure on the right for this impossible cut. We conclude that u xy ∈ S.
The above argumentation implies that edges of the type u xy , vx can not be in the cut (S,T). As such, there


are two type of edges in the cut (S,T): (i) (vx , t), for x ∈ S,
b and (ii) (s, u xy ) where at least one of x or y is not in
S. As such, the capacity of the cut (S,T) is
b
Õ Õ Õ Õ
c(S,T) = (m − wx ) + gxy = m Sb − w x + ­α − gxy ®.
© ª
x ∈S
b {x,y }1S
b x ∈S
b « {x,y } ⊆ S
b ¬
However, c(S,T) = γ < α, and it follows that
Õ Õ
m Sb − wx − gxy < α − α = 0.
x ∈S
b {x,y } ⊆ S
b
Õ Õ
Namely, wx + gxy > m Sb , as claimed.
x ∈S
b {x,y } ⊆ S
b

116
Chapter 17

Network Flow V - Min-cost flow

17.1. Minimum Average Cost Cycle


Let G = (V, E) be a digraph (i.e., a directed graph) with n vertices and m edges, and ω : E → R be a weight
function on the edges. A directed cycle is closed walk C = (v0, v1, . . . , vt ), where vt = v0 and (vi , vi+1 ) ∈ E, for
i = 0, . . . , t − 1. The average cost of a directed cycle is AvgCost(C) = ω(C)/t = ( e ∈C ω(e))/t.
Í
For each k = 0, 1, . . ., and v ∈ V, let dk (v) denote the minimum length of a walk with exactly k edges, ending
at v (note, that the walk can start anywhere). So, for each v, we have

d0 (v) = 0 and dk+1 (v) = min dk (u) + ω(e) .



e=(u,v)∈E

Thus, we can compute di (v), for i = 0, . . . , n and v ∈ V(G) in O(nm) time using dynamic programming.
Let

MinAvgCostCycle(G) = min AvgCost(C)


C is a cycle in G

denote the average cost of the minimum average cost cycle in G.


The following theorem is somewhat surprising.
Theorem 17.1.1. The minimum average cost of a directed cycle in G is equal to
n−1 dn (v) − dk (v)
α = min max .
v ∈V k=0 n−k
Namely, α = MinAvgCostCycle(G).

Proof: Note, that adding a quantity r to the weight of every edge of G increases the average cost of a cycle
AvgCost(C) by r. Similarly, α would also increase by r. In particular, we can assume that the price of the
minimum average cost cycle is zero. This implies that now all cycles have non-negative (average) cost.
Thus, from this point on we assume that MinAvgCostCycle(G) = 0, and we prove that α = 0 in this case.
This in turn would imply the theorem – indeed, given a graph where MinAvgCostCycle(G) , 0, then we will
shift the costs the edges so that it is zero, use the proof below, and then shift it back.
MinAvgCostCycle(G) = 0 =⇒ α ≥ 0: We can rewrite α as α = minu ∈V β(u),
where
v
n−1 dn (u) − dk (u)
β(u) = max
k=0 n−k
.
σ π
Assume, that α is realized by a vertex v; that is α = β(v). Let Pn be a walk
with n edges ending at v, of length dn (v). Since there are n vertices in G, it
Figure 17.1: Decomposing Pn
must be that Pn must contain a cycle. So, let us decompose Pn into a cycle
into a path σ and a cycle π.
π of length n − k and a path σ of length k (k depends on the length of the
cycle in Pn ). We have that

dn (v) = ω(Pn ) = ω(π) + ω(σ) ≥ ω(σ) ≥ dk (v),

117
since ω(π) ≥ 0 as π is a cycle (and we assumed that all cycles have zero or positive cost). As such, we have
dn (v) − dk (v) ≥ 0. As such, dn (v)−d
n−k
k (v)
≥ 0. Let
n−1 dn (v) − d j (v) dn (v) − dk (v)
β(v) = max ≥ ≥ 0.
j=0 n− j n−k
Now, α = β(v) ≥ 0, by the choice of v.
MinAvgCostCycle(G) = 0 =⇒ α ≤ 0: Let C = (v0, v1, . . . , vt ) be the directed cycle
w
of weight 0 in the graph. Observe, that min∞ j=0 d j (v0 ) must be realized (for the first τ
time) by an index r < n, since if it is longer, we can always shorten it by removing C
cycles and improve its price (since cycles have non-negative price). Let ξ denote
this walk of length r ending at v0 . Let w be a vertex on C reached by walking n − r
ρ
edges on C starting from v0 , and let τ denote this walk (i.e., |τ| = n − r). We have
that
ξ v0
dn (w) ≤ ω ξ || τ = dr (v0 ) + ω(τ),

(17.1)

where ξ || τ denotes the path formed by concatenating the path τ to ξ.


Similarly, let ρ be the walk formed by walking on C from w all the way back to v0 . Note that τ || ρ goes
around C several times, and as such, ω(τ || ρ) = 0, as ω(C) = 0. Next, for any k, since the shortest path with k
edges arriving to w can be extended to a path that arrives to v0 , by concatenating ρ to it, we have that
dk (w) + ω(ρ) ≥ dk+ |ρ | (v0 ) ≥ dr (v0 ) ≥ dn (w) − ω(τ),
by Eq. (17.1). Rearranging, we have that ω(ρ) ≥ dn (w) − ω(τ) − dk (w). Namely, we have
∀k 0 = ω(τ || ρ) = ω(ρ) + ω(τ) ≥ (dn (w) − ω(τ) − dk (w)) + ω(τ) = dn (w) − dk (w).
dn (w) − dk (w)
=⇒ ∀k ≤0
n−k
n−1 dn (w) − dk (w)
=⇒ β(w) = max ≤ 0.
k=0 n−k
As such, α = min β(v) ≤ β(w) ≤ 0, and we conclude that α = 0.
v ∈V (G)

Finding the minimum average cost cycle is now not too hard. We compute the vertex v that realizes α in
Theorem 17.1.1. Next, we add −α to all the edges in the graph. We now know that we are looking for a cycle
with price 0. We update the values di (v) to agree with the new weights of the edges.
Now, v is the vertex realizing the quantity 0 = α = minu ∈V maxk=0 n−1 dn (u)−dk (u) . Namely, we have that for the
n−k
vertex v it holds
n−1 dn (v) − dk (v) dn (v) − dk (v)
max = 0 =⇒ ∀k ∈ {0, . . . , n − 1} ≤0
k=0 n−k n−k
=⇒ ∀k ∈ {0, . . . , n − 1} dn (v) − dk (v) ≤ 0.
This implies that dn (v) ≤ di (v), for all i. Now, we repeat the proof of Theorem 17.1.1. Let Pn be the path with
n edges realizing dn (v). We decompose it into a path π of length k and a cycle τ. We know that ω(τ) ≥ 0 (since
all cycles have non-negative weights now). Now, ω(π) ≥ dk (v). As such, ω(τ) = dn (v) − ω(π) ≤ dn (v) − dk (v) ≤ 0,
as π is a path of length k ending at v, and its cost is ≥ dk (v). Namely, the cycle τ has ω(τ) ≤ 0, and it the
required cycle and computing it required O(nm) time.
Note, that the above reweighting in fact was not necessary. All we have to do is to compute the node
realizing α, extract Pn , and compute the cycle in Pn , and we are guaranteed by the above argumentation, that
this is the cheapest average cycle.
Corollary 17.1.2. Given a direct graph G with n vertices and m edges, and a weight function ω(·) on the edges,
one can compute the cycle with minimum average cost in O(nm) time.

118
17.2. Potentials
In general computing the shortest path in a graph that have negative weights is harder than just using the
Dijkstra algorithm (that works only for graphs with non-negative weights on its edges). One can use Bellman-
Ford algorithm¬ in this case, but it considerably slower (i.e., it takes O(mn) time). We next present a case
where one can still use Dijkstra algorithm, with slight modifications.
The following is only required in the analysis of the minimum-cost flow algorithm we present later in this
chapter. We describe it here in full detail since its simple and interesting.
For a directed graph G = (V, E) with weight w(·) on the edges, let dω (s, t) denote the length of the shortest
path between s and t in G under the weight function w. Note, that w might assign negative weights to edges
in G.
A potential p(·) is a function that assigns a real value to each vertex of G, such that if e = (u, v) ∈ G then
w(e) ≥ p(v) − p(u).

Lemma 17.2.1. (i) There exists a potential p(·) for G if and only if G has no negative cycles (with respect to
w(·)).
(ii) Given a potential function p(·), for an edge e = (u, v) ∈ E(G), let `(e) = w(e) − p(v) + p(u). Then `(·) is
non-negative for the edges in the graph and for any pair of vertices s, t ∈ V(G), we have that the shortest path π
realizing d` (s, t) also realizes dω (s, t).
(iii) Given G and a potential function p(·), one can compute the shortest path from s to all the vertices of
G in O(n log n + m) time, where G has n vertices and m edges

Proof: (i) Consider a cycle C, and assume there is a potential p(·) for G, and observe that
Õ Õ
w(C) = w(e) ≥ (p(v) − p(u)) = 0,
(u,v)∈E(C) (u,v)∈E(C)

as required.
For a vertex v ∈ V(G), let p(v) denote the shortest walk that ends at v in G. We claim that p(v) is a potential.
Since G does not have negative cycles, the quantity p(v) is well defined. Observe that p(v) ≤ p(u) + w(u → v)
since we can always continue a walk to u into v by traversing (u, v). Thus, p(v) − p(u) ≤ w(u → v), as required.
(ii) Since `(e) = w(e) − p(v) + p(u) we have that w(e) ≥ p(v) − p(u) since p(·) is a potential function. As such
w(e) − p(v) + p(u) ≥ 0, as required.
As for the other claim, observe that for any path π in G starting at s and ending at t we have that
Õ
`(π) = (w(e) − p(v) + p(u)) = w(π) + p(s) − p(t),
e=(u,v)∈π

which implies that d` (s, t) = dω (s, t) + p(s) − p(t). Implying the claim.
(iii) Just use the Dijkstra algorithm on the distances defined by `(·). The shortest paths are preserved under
this distance by (ii), and this distance function is always positive.

17.3. Minimum cost flow


Given a network flow G = (V, E) with source s and sink t, capacities c(·) on the edges, a real number φ, and a
cost function κ(·) on the edges. The cost of a flow f is defined to be
Õ
cost(f) = κ(e) ∗ f(e).
e ∈E

The minimum-cost s-t flow problem ask to find the flow f that minimizes the cost and has value φ.
¬ http://en.wikipedia.org/wiki/Bellman-Ford_algorithm

119
It would be easier to look on the problem of minimum-cost circulation problem. Here, we are given
instead of φ a lower-bound `(·) on the flow on every edge (and the regular upper bound c(·) on the capacities of
the edges). All the flow coming into a node must leave this node. It is easy to verify that if we can solve the
minimum-cost circulation problem, then we can solve the min-cost flow problem. Thus, we will concentrate on
the min-cost circulation problem.
An important technicality is that all the circulations we discuss here have zero demands on the vertices. As
such, a circulation can be conceptually considered to be a flow going around in cycles in the graph without ever
stopping. In particular, for these circulations, the conservation of flow property should hold for all the vertices
in the graph.
The residual graph of f is the graph Gf = (V, Ef ) where
n    o
Ef = e = (u, v) ∈ V × V f(e) < c(e) or f e−1 > ` e−1 .

where e−1 = (v, u) if e = (u, v). Note, that the definition of the residual network takes into account the lower-
bound on the capacity of the edges.
Assumption 17.3.1. To simplify the exposition, we will assume that if (u, v) ∈ E(G) then (v, u) < E(G), for
all u, v ∈ V(G). This can be easily enforced by introducing a vertex in the middle of every edge of G. This is
acceptable, since we are more concerned with solving the problem at hand in polynomial time, than the exact
complexity. Note, that our discussion can be extended to handle the slightly more general case, with a bit of
care.

We extend the cost function to be anti-symmetric; namely,

∀(u, v) ∈ Ef κ (u, v) = −κ (v, u) .


 

Consider a directed cycle C in Gf . For an edge e = (u, v) ∈ E, we define




 1 e∈C


χC (e) = −1 e−1 = (v, u) ∈ C

0

otherwise;

that is, we pay 1 if e is in C and −1 if we travel e in the “wrong” direction.
The cost of a directed cycle C in Gf is defined as
Õ
κ(C) = κ(e).
e∈C

We will refer to a circulation that comply with the capacity and lower-bounds constraints as being valid.
A function that just comply with the conservation property (i.e., all incoming flow into a vertex leaves it), is
a weak circulation. In particular, a weak circulation might not comply with the capacity and lower bounds
constraints of the given instance, and as such is not a valid circulation.
We need the following easy technical lemmas.
Lemma 17.3.2. Let f and g be two valid circulations in G = (V, E). Consider the function h = g − f. Then, h
is a weak circulation, and if h(u → v) > 0 then the edge (u, v) ∈ Gf .

Proof: The fact that h is a circulation is trivial, as it is the difference between two circulations, and as such the
same amount of flow that comes into a vertex leaves it, and thus it is a circulation. (Note, that h might not be
a valid circulation, since it might not comply with the lower-bounds on the edges.)
Observe, that if h(u → v) is negative, then h(v → u) = −h(u → v) by the anti-symmetry of f and g, which
implies the same property holds for h.
Consider an arbitrary edge e = (u, v) such that h(u → v) > 0.

120
There are two possibilities. First, if e = (u, v) ∈ E, and f(e) < c(e), then the claim trivially holds, since
then e ∈ Gf . Thus, consider the case when f(e) = c(e), but then h(e) = g(e) − f(e) ≤ 0. Which contradicts our
assumption that h(u → v) > 0.
possibility, is that e = (u, v) < E. But then e−1 = (v, u) must be in E, and it holds 0 > h e−1 =

The second
g e−1 − f e−1 . Implying that f e−1 > g e−1 ≥ ` e−1 . Namely, there is a flow by f in G going in the direction
 

of e−1 which larger than the lower bound. Since we can return this flow in the other direction, it must be that
e ∈ Gf .

Lemma 17.3.3. Let f be a circulation in a graph G. Then, f can be decomposed into at most m cycles,
C1, . . . , Cm , such that, for any e ∈ E(G), we have
t
Õ
f(e) = λi · χCi (e),
i=1

where λ1, . . . , λt > 0 and t ≤ m, where m is the number of edges in G.

Proof: Since f is a circulation, and the amount of flow into a node is equal to the amount of flow leaving the
node, it follows that as long as f not zero, one can find a cycle in f. Indeed, start with a vertex which has
non-zero amount of flow into it, and walk on an adjacent edge that has positive flow on it. Repeat this process,
till you visit a vertex that was already visited. Now, extract the cycle contained in this walk.
Let C1 be such a cycle, and observe that every edge of C1 has positive flow on it, let λ1 be the smallest
amount of flow on any edge of C1 , and let e1 denote this edge. Consider the new flow g = f − λ1 · χC1 . Clearly, g
has zero flow on e1 , and it is a circulation. Thus, we can remove e1 from G, and let J denote the new graph. By
induction, applied to g on J, the flow g can be decomposed into m − 1 cycles with positive coefficients. Putting
these cycles together with λ1 and C1 implies the claim.

Theorem 17.3.4. A flow f is a minimum cost feasible circulation if and only if each directed cycle of Gf has
nonnegative cost.

Proof: Let C be a negative cost cycle in Gf . Then, we can circulate more flow on C and get a flow with smaller
price. In particular, let ε > 0 be a sufficiently small constant, such that g = f + ε ∗ χC is still a feasible circulation
(observe, that since the edges of C are Gf , all of them have residual capacity that can be used to this end).
Now, we have that
Õ Õ
cost(g) = cost(f) + κ(e) ∗ ε = cost(f) + ε ∗ κ(e) = cost(f) + ε ∗ κ(C) < cost(f),
e ∈C e ∈C

since κ(C) < 0, which is a contradiction to the minimality of f.


As for the other direction, assume that all the cycles in Gf have non-negative cost. Then, let g be any
feasible circulation. Consider the circulation h = g − f. By Lemma 17.3.2, all the edges used by h are in Gf , and
by Lemma 17.3.3 we can find t ≤ |E(Gf )| cycles C1, . . . , Ct in Gf , and coefficients λ1, . . . , λt , such that
t
Õ
h(e) = λi χCi (e).
i=1

We have that
t
! t t
Õ Õ  Õ
cost(g) − cost(f) = cost(h) = cost λi χCi = λi cost χCi = λi κ(Ci ) ≥ 0,
i=1 i=1 i=1

as κ(Ci ) ≥ 0, since there are no negative cycles in Gf . This implies that cost(g) ≥ cost(f). Namely, f is a
minimum-cost circulation.

121
17.4. Strongly Polynomial Time Algorithm for Min-Cost Flow
The algorithm would start from a feasible circulation f. We know how to compute such a flow f using the
standard max-flow algorithm. At each iteration, it would find the cycle C of minimum average cost cycle in Gf
(using the algorithm of Section 17.1). If the cost of C is non-negative, we are done since we had arrived to the
minimum cost circulation, by Theorem 17.3.4.
Otherwise, we circulate as much flow as possible along C (without violating the lower-bound constraints
and capacity constraints), and reduce the price of the flow f. By Corollary 17.1.2, we can compute such a cycle
in O(mn) time. Since the cost of the flow is monotonically decreasing the algorithm would terminate if all the
number involved are integers. But we will show that this algorithm performs a polynomial number of iterations
in n and m.
It is striking how simple is this algorithm, and the fact that it works in polynomial time. The analysis is
somewhat more painful.

17.5. Analysis of the Algorithm

To analyze the above algorithm, let fi


f, g, h, i Flows or circulations
be the flow in the beginning of the ith
Gf The residual graph for f
iteration. Let Ci be the cycle used in
c(e) The capacity of the flow on e
the ith iteration. For a flow f, let Cf the
`(e) The lower-bound (i.e., demand) on the flow on e
minimum average-length cycle of Gf , and
cost(f) The overall cost of the flow f
let µ(f) = κ(Cf )/|Cf | denote the average
“cost” per edge of Cf . κ(e) The cost of sending one unit of flow on e
The following lemma, states that we ψ(e) The reduced cost of e
are making “progress” in each iteration of Figure 17.2: Notation used.
the algorithm.

Lemma 17.5.1. Let f be a flow, and let g the flow resulting from applying the cycle C = Cf to it. Then,
µ(g) ≥ µ(f).

Proof: Assume for the sake of contradiction, that µ(g) < µ(f). Namely, we have

κ(Cg ) κ(Cf )
< . (17.2)
Cg |Cf |

Now, the only difference between Gf and Gg are the edges of Cf . In particular, some edges of Cf might
disappear from Gg , as they are being used in g to their full capacity. Also, all the edges in the opposite direction
to Cf will be present in Gg .
Now, Cg must use at least one of the new edges in Gg , since otherwise this would contradict the minimality
of Cf (i.e., we could use Cg in Gf and get a cheaper average cost cycle thann Cf ). Leto U be the set of new edges
of Gg that are being used by Cg and are not present in Gf . Let U −1 = e−1 e ∈ U . Clearly, all the edges of
U −1 appear in Cf .
Now, consider the cycle π = Cf ∪ Cg . We have that the average of π is
!
κ(Cf ) + κ(Cg ) κ(Cg ) κ(Cf )
α= < max , = µ(f),
|Cf | + Cg Cg |Cf |

by Eq. (17.2). We can write π is a union of k edge-disjoint cycles σ1, . . . , σk and some 2-cycles. A 2-cycle is
formed by a pair of edges e and e−1 where e ∈ U and e−1 ∈ U −1 . Clearly, the cost of these 2-cycles is zero. Thus,

122
since the cycles σ1, . . . , σk have no edges in U, it follows that they are all contained in Gf . We have
 Õ
κ(Cf ) + κ Cg = κ(σi ) + 0.
i

Thus, there is some non-negative integer constant c, such that


κ(Cf ) + κ(Cg ) i κ(σi ) κ(σi )
Í Í
α= = ≥ Íi ,
c + i |σi |
Í
|Cf | + Cg i |σi |

since α is negative (since α < µ(f) < 0 as otherwise the algorithm would had already terminated). Namely,
µ(f) > ( i κ(σi ))/( i |σi |). Which implies that there is a cycle σr , such that µ(f) > κ(σr )/|σr | and this cycle is
Í Í
contained in Gf . But this is a contradiction to the minimality of µ(f).

17.5.1. Reduced cost induced by a circulation


Conceptually, consider the function µ(f) to be a potential function that increases as the algorithm progresses.
To make further progress in our analysis, it would be convenient to consider a reweighting of the edges of G, in
such a way that preserves the weights of cycles.
Given a circulation f, we are going to define a different cost function on the edges which is induced by f. To
begin with, let β(u → v) = κ(u → v) − µ(f). Note, that under the cost function α, the cheapest cycle has price 0
in G (since the average cost of an edge in the cheapest average cycle has price zero). Namely, G has no negative
cycles under β. Thus, for every vertex v ∈ V(G), let d(v) denote the length of the shortest walk that ends at v.
The function d(v) is a potential in G, by Lemma 17.2.1, and as such

d(v) − d(u) ≤ β(u → v) = κ(u → v) − µ(f). (17.3)

Next, let the reduced cost of (u, v) (in relation to f) be

ψ(u → v) = κ(u → v) + d(u) − d(v).

In particular, Eq. (17.3) implies that

∀(u, v) ∈ E(Gf ) ψ(u → v) = κ(u → v) + d(u) − d(v) ≥ µ(f). (17.4)

Namely, the reduced cost of any edge (u, v) is at least µ(f).


Note that ψ(v → u) = κ(v → u) + d(v) − d(u) = −κ(u → v) + d(v) − d(u) = −ψ(u → v) (i.e., it is anti-symmetric).
Also, for any cycle C in G, we have that κ(C) = ψ(C), since the contribution of the potential d(·) cancels out.
The idea is that now we think about the algorithm as running with the reduced cost instead of the regular
costs. Since the costs of cycles under the original cost and the reduced costs are the same, negative cycles are
negative in both costs. The advantage is that the reduced cost is more useful for our purposes.

17.5.2. Bounding the number of iterations


Lemma 17.5.2. Let f be a flow used in the ith iteration of the algorithm, let g be the flow used in the (i + m)th
iteration, where m is the number of edges in G. Furthermore, assume that the algorithm performed at least one
more iteration on g. Then, µ(g) ≥ (1 − 1/n)µ(f).

Proof: Let C0, . . . , Cm−1 be the m cycles used in computing g from f. Let ψ(·) be the reduced cost function
induced by f.
If a cycle has only negative reduced cost edges, then after it is applied to the flow, one of these edges
disappear from the residual graph, and the reverse edge (with positive reduced cost) appears in the residual
graph. As such, if all the edges of these cycles have negative reduced costs, then Gg has no negative reduced

123
cost edge, and as such µ(g) ≥ 0. But the algorithm stops as soon as the average cost cycle becomes positive. A
contradiction to our assumption that the algorithm performs at least another iteration.
Let Ch be the first cycle in this sequence, such that it contains an edge e 0, such that its reduced cost is
positive; that is ψ(e 0) ≥ 0. Note, that Ch has most n edges. We have that
Õ Õ
κ(Ch ) = ψ(Ch ) = ψ(e) = ψ(e 0) + ψ(e) ≥ 0 + (|Ch | − 1)µ(f),
e ∈Ch e ∈Ch ,e,e0

by Eq. (17.4). Namely, the average cost of Ch is


 
κ(Ch ) |Ch | − 1 1
0 > µ(fh ) = ≥ µ(f) ≥ 1 − µ(f).
|Ch | |Ch | n
The claim now easily follows from Lemma 17.5.1.

To bound the running time of the algorithm, we will argue that after sufficient number of iterations edges
start disappearing from the residual network and never show up again in the residual network. Since there are
only 2m possible edges, this would imply the termination of the algorithm.
Observation 17.5.3. We have that (1 − 1/n)n ≤ (exp(−1/n))n ≤ 1/e, since 1 − x ≤ e−x , for all x ≥ 0, as can be
easily verified.

Lemma 17.5.4. Let f be the circulation maintained by the algorithm at iteration ρ. Then there exists an edge
e in the residual network Gf such that it never appears in the residual networks of circulations maintained by
the algorithm, for iterations larger than ρ + t, where t = 2nm dln ne.

Proof: Let g be the flow used by the algorithm at iteration ρ + t. We define the reduced cost over the edges of
G, as induced by the flow g. Namely,

ψ(u → v) = κ(u → v) + d(u) − d(v),

where d(u) is the length of the shortest walk ending at u where the weight of edge (u, w) is κ(u → w) − µ(g).
Now, conceptually, we are running the algorithm using this reduced cost function flow in iteration
over the edges, and consider the minimum average cost cycle at iteration ρ with cost
f ρ
α = µ(f). There must be an edge e ∈ E(Gf ), such that ψ(e) ≤ α. (Note, that α is a
g ρ+t
negative quantity, as otherwise the algorithm would have terminated at iteration ρ.)
h ρ+t+τ
We have that, at iteration ρ + t, it holds
 t
1 α
µ(g) ≥ α ∗ 1 − ≥ α ∗ exp(−2m dln ne) ≥ , (17.5)
n 2n
by Lemma 17.5.2 and Observation 17.5.3 and since α < 0. On the other hand, by Eq. (17.4), we know that
for all the edges f in E Gg , it holds ψ(f) ≥ µ(g) ≥ α/2n. As such, e can not be an edge of Gg since ψ(e) ≤ α.


Namely, it must be that g(e) = c(e).


So, assume that at a later iteration, say ρ + t + τ, the edge e reappeared in the residual graph. Let h be the
flow at the (ρ + t + τ)th iteration, and let Gh be the residual graph. It must be that h(e) < c(e) = g(e).
Now, consider the circulation i = g − h. It has a positive flow on the edge e, since i(e) = g(e) − h(e) > 0. In
particular, there is a directed cycle C of positive flow of i in Gi that includes e, as implied by Lemma 17.3.3.
Note, that Lemma 17.3.2 implies that C is also a cycle of Gh .
Now, the edges of C−1 are present in Gg . To see that, observe that for every  edge g ∈ C, we have that
0 < i(g) = g(g) − h(g) ≤ g(g) − `(g). Namely, g(g) > `(g) and as such g −1 ∈ E G . As such, by Eq. (17.4), we
g
have ψ g−1 ≥ µ(g). This implies

  α
∀g ∈ C ψ(g) = −ψ g−1 ≤ −µ(g) ≤ − ,
2n

124
by Eq. (17.5). Since C is a cycle of Gh , we have
 α α
κ(C) = ψ(C) = ψ(e) + ψ(C \ {e}) ≤ α + (|C| − 1) · − < .
2n 2
Namely, the average cost of the cycle C, which is present in Gh , is κ(C)/|C| < α/(2n).
α
On the other hand, the minimum average cost cycle in Gh has average price µ(h) ≥ µ(g) ≥ 2n , by
Lemma 17.5.1. A contradiction, since we found a cycle C in Gh which is cheaper.

We are now ready for the “kill” – since one edge disappears forever every O(mn log n) iterations, it follows that
after O(m2 n log n) iterations the algorithm terminates. Every iteration takes O(mn) time, by Corollary 17.1.2.
Putting everything together, we get the following.

Theorem 17.5.5. Given a digraph G with n vertices and m edges, lower bound and upper bound on the flow of
each edge, and a cost associated with each edge, then one can compute a valid circulation of minimum-cost in
O(m3 n2 log n) time.

17.6. Bibliographical Notes


The minimum average cost cycle algorithm, of Section 17.1, is due to Karp [Kar78].
The description here follows very roughly the description of [Sch04]. The first strongly polynomial time
algorithm for minimum-cost circulation is due to Éva Tardos [Tar85]. The algorithm we show is an improved
version due to Andrew Goldberg and Robert Tarjan [GT89]. Initial research on this problem can be traced
back to the 1940s, so it took almost fifty years to find a satisfactory solution to this problem.

Chapter 18

Network Flow VI - Min-Cost Flow Applica-


tions

18.1. Efficient Flow

125
A flow f would be considered to be efficient if it con- t
tains no cycles in it. Surprisingly, even the Ford-Fulkerson
algorithm might generate flows with cycles in them. As a
concrete example consider the picture on the right. A disc in
the middle of edges indicate that we split the edge into mul-
tiple edges by introducing a vertex at this point. All edges v
have capacity one. For this graph, Ford-Fulkerson would first
augment along s → w → u → t. Next, it would augment
along s → u → v → t, and finally it would augment along
s → v → w → t. But now, there is a cycle in the flow; namely,
u w
u → v → w → u.
One easy way to avoid such cycles is to first compute
the max flow in G. Let α be the value of this flow. Next,
we compute the min-cost flow in this network from s to t
with flow α, where every edge has cost one. Clearly, the flow
computed by the min-cost flow would not contain any such
cycles. If it did contain cycles, then we can remove them by s
pushing flow against the cycle (i.e., reducing the flow along
the cycle), resulting in a cheaper flow with the same value, which would be a contradiction. We got the following
result.

Theorem 18.1.1. Computing an efficient (i.e., acyclic) max-flow can be done in polynomial time.

(BTW, this can also be achieved directly by removing cycles directly in the flow. Naturally, this flow might
be less efficient than the min-cost flow computed.)

18.2. Efficient Flow with Lower Bounds


Consider the problem AFWLB (acyclic flow with lower-bounds) of computing efficient flow, where we have lower
bounds on the edges. Here, we require that the returned flow would be integral, if all the numbers involved
are integers. Surprisingly, this problem which looks like very similar to the problems we know how to solve
efficiently is NP-Complete. Indeed, consider the following problem.

Hamiltonian Path
Instance: A directed graph G and two vertices s and t.
Question: Is there a Hamiltonian path (i.e., a path visiting every vertex exactly once) in G starting
at s and ending at t?

It is easy to verify that Hamiltonian Path is NP-Complete¬ . We reduce this problem to AFWLB by
replacing each vertex of G with two vertices and a direct edge in between them (except for the source vertex s
and the sink vertex t). We set the lower-bound and capacity of each such edge to 1. Let J denote the resulting
graph.
Consider now acyclic flow in J of capacity 1 from s to t which is integral. Its 0/1-flow, and as such it defines
a path that visits all the special edges we created. In particular, it corresponds to a path in the original graph
that starts at s, visits all the vertices of G and ends up at t. Namely, if we can compute an integral acyclic
flow with lower-bounds in J in polynomial time, then we can solve Hamiltonian path in polynomial time. Thus,
AFWLB is NP-Hard.
¬ Verify that you know to do this — its a natural question for the exam.

126
Theorem 18.2.1. Computing an efficient (i.e., acyclic) max-flow with lower-bounds is NP-Hard (where the
flow must be integral). The related decision problem (of whether such a flow exist) is NP-Complete.

By this point you might be as confused as I am. We can model an acyclic max-flow problem with lower
bounds as min-cost flow, and solve it, no? Well, not quite. The solution returned from the min-cost flow might
have cycles and we can not remove them by canceling the cycles. That was only possible when there was no
lower bounds on the edge capacities. Namely, the min-cost flow algorithm would return us a solution with
cycles in it if there are lower bounds on the edges.

18.3. Shortest Edge-Disjoint Paths


Let G be a directed graph. We would like to compute k-edge disjoint paths between vertices s and t in the
graph. We know how to do it using network flow. Interestingly, we can find the shortest k-edge disjoint paths
using min-cost flow. Here, we assign cost 1 for every edge, and capacity 1 for every edge. Clearly, the min-cost
flow in this graph with value k, corresponds to a set of k edge disjoint paths, such that their total length is
minimized.

18.4. Covering by Cycles


Given a direct graph G, we would like to cover all its vertices by a set of cycles which are vertex disjoint. This
can be done again using min-cost flow. Indeed, replace every vertex u in G by an edge (u 0, u 00). Where all the
incoming edges to u are connected to u 0 and all the outgoing edges from u are now starting from u 00. Let J
denote the resulting graph. All the new edges in the graph have a lower bound and capacity 1, and all the other
edges have no lower bound, but their capacity is 1. We compute the minimum cost circulation in J. Clearly,
this corresponds to a collection of cycles in G covering all the vertices of minimum cost.

Theorem 18.4.1. Given a directed graph G and costs on the edges, one can compute a cover of G by a collection
of vertex disjoint cycles, such that the total cost of the cycles is minimized.

18.5. Minimum weight bipartite matching

Given an undirected bipartite graph G, we would like to find the 1


maximum cardinality matching in G that has minimum cost. The idea
is to reduce this to network flow as we did in the unweighted case, and 1
compute the maximum flow – the graph constructed is depicted on the
right. Here, any edge has capacity 1. This gives us the size φ of the t
s
maximum matching in G. Next, we compute the min-cost flow in G
with this value φ, where the edges connected to the source or the sing
has cost zero, and the other edges are assigned their original cost in 1
G. Clearly, the min-cost flow in this graph corresponds to a maximum
cardinality min-cost flow in the original graph.
Here, we are using the fact that the flow computed is integral, and 1
as such, it is a 0/1-flow.

Theorem 18.5.1. Given a bipartite graph G and costs on the edges, one can compute the maximum cardinality
minimum cost matching in polynomial time.

127
18.6. The transportation problem
In the transportation problem, we are given m facilities f1, . . . , fm . The facility fi contains xi units of some
commodity, for i = 1, . . . , m. Similarly, there are u1, . . . , un customers that would like to buy this commodity.
In particular, ui would like to by di units, for i = 1, . . . , n. To make things interesting, it costs ci j to send one
unit of commodity from facility i to costumer j. The natural question is how to supply the demands while
minimizing the total cost.
To this end, we create a bipartite graph with f1, . . . , fm on one side, and u1, . . . , un on the other side. There
is an edge from fi , u j with costs ci j , for i = 1, . . . , m and j = 1, . . . , n. Next, we create a source vertex that


is connected to fi with capacity xi , for i = 1, . . . , m. Similarly, we create an edges from u j to the sink t, with
capacity di , for j = 1, . . . n. We compute the min-cost flow in this network that pushes φ = j dk units from the
Í
source to the sink. Clearly, the solution encodes the required optimal solution to the transportation problem.

Theorem 18.6.1. The transportation problem can be solved in polynomial time.

128
Part VI
Linear Programming

Chapter 19

Linear Programming in Low Dimensions

At the sight of the still intact city, he remembered his great international precursors and set the whole place on fire with his
artillery in order that those who came after him might work off their excess energies in rebuilding.

The tin drum, Gunter Grass

In this chapter, we shortly describe (and analyze) a simple randomized algorithm for linear programming in low
dimensions. Next, we show how to extend this algorithm to solve linear programming with violations. Finally,
we will show how one can efficiently approximate the number constraints that one needs to violate to make
a linear program feasible. This serves as a fruitful ground to demonstrate some of the techniques we visited
already. Our discussion is going to be somewhat intuitive – it can be made more formal with more work.

19.1. Some geometry first


We first prove Radon’s and Helly’s theorems.

Definition 19.1.1. The convex hull of a set P ⊆ Rd is the set of all convex combinations of points of P; that is,
n Õm Õm o
CH (P) = αi si ∀i si ∈ P, αi ≥ 0, and αi = 1 .
i=0 j=1

Claim 19.1.2. Let P = {p1, . . . , pd+2 } be a set of d + 2 points in Rd . There are real numbers β1, . . . , βd+2 , not
all of them zero, such that i βi pi = 0 and i βi = 0.
Í Í

Proof: Indeed, set qi = (pi , 1), for i = 1, . . . , d + 2. Now, the points q1, . . . , qd+2 ∈ Rd+1 are linearly dependent,
Íd+2
and there are coefficients β1, . . . , βd+2 , not all of them zero, such that i=1 βi qi = 0. Considering only the first
d coordinates of these points implies that i=1 βi pi = 0. Similarly, by considering only the (d + 1)st coordinate
Íd+2
Íd+2
of these points, we have that i=1 βi = 0.

Theorem 19.1.3 (Radon’s theorem). Let P = {p1, . . . , pd+2 } be a set of d + 2 points in Rd . Then, there exist
two disjoint subsets C and D of P, such that CH (C) ∩ CH (D) , ∅ and C ∪ D = P.

129
Proof: By Claim 19.1.2 there are real numbers β1, . . . , βd+2 , not all of them zero, such that i βi pi = 0 and
Í

i βi = 0.
Í
Assume, for the sake of simplicity of exposition, that β1, . . . , βk ≥ 0 and βk+1, . . ., βd+2 < 0. Furthermore, let
Ík Íd+2
µ = i=1 βi = − i=k+1 βi . We have that
Õk d+2
Õ
βi pi = − βi pi .
i=1 i=k+1
Ík
In particular, v = i=1 (βi /µ)pi is a point in CH ({p1, . . . , pk }). Furthermore, for the same point v we have
Íd+2
v = i=k+1 −(βi /µ)pi ∈ CH ({pk+1, . . . , pd+2 }). We conclude that v is in the intersection of the two convex hulls,
as required.

Theorem 19.1.4 (Helly’s theorem). Let F be a set of n convex sets in Rd . The intersection of all the sets
of F is non-empty if and only if any d + 1 of them has non-empty intersection.

Proof: This theorem is the “dual” to Radon’s theorem.


If the intersection of all sets in F is non-empty, then any intersection of d + 1 of them is non-empty. As for
the other direction, assume for the sake of contradiction that F is the minimal set of convex sets for which the
claim fails. Namely, for m = |F| > d + 1, any subset of m − 1 sets of F has non-empty intersection, and yet the
intersection of all the sets of F is empty. n o
As such, for X ∈ F, let pX be a point in the intersection of all sets of F excluding X. Let P = pX X ∈ F .
Here |P| = |F| > d + 1. By Radon’s theorem, there is a partition of P into two disjoint sets R and Q such that
CH (R) ∩ CH (Q) , ∅. Let s be any point inside this non-empty intersection.
Let U(R) = {X | pX ∈ R} and U(Q) = {X | pX ∈ Q} be the two subsets of F corresponding to R and Q,
respectively. By definition, for X ∈ U(R), we have that
Ù Ù Ù
pX ∈ Y ⊆ Y= Y,
Y ∈F,Y,X Y ∈F\U(R) Y ∈U(Q)

since U(Q) ∪ U(R) = F and U(Q) ∩ U(R) = ∅. As such, R ⊆ Y ∈U(Q) Y and Q ⊆ Y ∈U(R) Y . Now, by the convexity
Ñ Ñ
of the sets of F, we have CH (R) ⊆ Y ∈U(Q) Y and CH (Q) ⊆ Y ∈U(R) Y . Namely, we have
Ñ Ñ

 Ù   Ù  Ù
s ∈ CH (R) ∩ CH (Q) ⊆ Y ∩ Y = Y.
Y ∈U(Q) Y ∈U(R) Y ∈F

Namely, the intersection of all the sets of F is not empty, a contradiction.

19.2. Linear programming


Assume we are given a set of n linear inequalities of the form a1 x1 + · · · + ad xd ≤ b, where a1, . . . , ad , b are
constants and x1, . . . , xd are the variables. In the linear programming (LP) problem, one has to find a
feasible solution, that is, a point (x1, . . . , xd ) for which all the linear inequalities hold. In the following, we
use the shorthand LPI to stand for linear programming instance. Usually we would like to find a feasible
point that maximizes a linear expression (referred to as the target function of the given LPI) of the form
c1 x1 + · · · + cd xd , where c1, . . . , cd are prespecified constants.
3y
The set of points complying with a linear inequality a1 x1 + · · · + ad xd ≤ b is +2
x=
a halfspace of R having the hyperplane a1 x1 + · · · + ad xd = b as a boundary; see
d 6

the figure on the right. As such, the feasible region of a LPI is the intersection of
n halfspaces; that is, it is a polyhedron. If the polyhedron is bounded, then it 3y + 2x ≤ 6
is a polytope. The linear target function is no more than specifying a direction,
such that we need to find the point inside the polyhedron which is extreme in
this direction. If the polyhedron is unbounded in this direction, the optimal solution is unbounded.

130
For the sake of simplicity of exposition, it will be easier to think of the
direction for which one has to optimize as the negative xd -axis direction. This
Feasible region
can be easily realized by rotating the space such that the required direction is
pointing downward. Since the feasible region is the intersection of convex sets
(i.e., halfspaces), it is convex. As such, one can imagine the boundary of the
feasible region as a vessel (with a convex interior). Next, we release a ball at
the top of the vessel, and the ball rolls down (by “gravity” in the direction of
the negative xd -axis) till it reaches the lowest point in the vessel and gets “stuck”. This point is the optimal
solution to the LPI that we are interested in computing.
In the following, we will assume that the given LPI is in general position. Namely, if we intersect k hy-
perplanes, induced by k inequalities in the given LPI (the hyperplanes are the result of taking each of this
inequalities as an equality), then their intersection is a (d − k)-dimensional affine subspace. In particular, the
intersection of d of them is a point (referred to as a vertex). Similarly, the intersection of any d + 1 of them is
empty.  
A polyhedron defined by an LPI with n constraints might have O n bd/2c vertices on its boundary (this is
known as the upper-bound theorem [Grü03]). As we argue below, the optimal solution is a vertex. As such,
a naive algorithm would enumerate all relevant vertices (this is a non-trivial undertaking) and return the best
possible vertex. Surprisingly, in low dimension, one can do much better and get an algorithm with linear running
time.
We are interested in the best vertex of the feasible region, while this polyhedron is defined implicitly as the
intersection of halfspaces, and this hints to the quandary that we are in: We are looking for an optimal vertex
in a large graph that is defined implicitly. Intuitively, this is why proving the correctness of the algorithms we
present here is a non-trivial undertaking (as already mentioned, we will prove correctness in the next chapter).

19.2.1. A solution and how to verify it


Observe that an optimal solution of an LPI is either a vertex or unbounded. Indeed, if the optimal solution
p lies in the middle of a segment s, such that s is feasible, then either one of its endpoints provides a better
solution (i.e., one of them is lower in the xd -direction than p) or both endpoints of s have the same target
value. But then, we can move the solution to one of the endpoints of s. In particular, if the solution lies on a
k-dimensional facet F of the boundary of the feasible polyhedron (i.e., formally F is a set with affine dimension
k formed by the intersection of the boundary of the polyhedron with a hyperplane), we can move it so that it
lies on a (k − 1)-dimensional facet F 0 of the feasible polyhedron, using the proceeding argumentation. Using it
repeatedly, one ends up in a vertex of the polyhedron or in an unbounded solution.
Thus, given an instance of LPI, the LP solver should output one of the following answers.

(A) Finite. The optimal solution is finite, and the solver would provide a vertex which realizes the optimal
solution.

(B) Unbounded. The given LPI has an unbounded solution. In this case, the LP solver would output a ray
ζ, such that the ζ lies inside the feasible region and it points down the negative xd -axis direction.

(C) Infeasible. The given LPI does not have any point which complies with all the given inequalities. In this
case the solver would output d + 1 constraints which are infeasible on their own.

Lemma 19.2.1. Given a set of d linear inequalities in Rd , one can compute the vertex induced by the inter-
section of their boundaries in O d 3 time.

Proof: Write down the system of equalities that the vertex must fulfill. It is a system of d equalities in d
variables and it can be solved in O d 3 time using Gaussian elimination.


131
A cone is the intersection of d constraints, where its apex is the vertex associated with this set of constraints.
A set of such d constraints is a basis. An intersection of d − 1 of the hyperplanes of a basis forms a line and
intersecting this line with the cone of the basis forms a ray. Clipping the same line to the feasible region would
yield either a segment, referred to as an edge of the polyhedron, or a ray (if the feasible region is an unbounded
polyhedron). An edge of the polyhedron connects two vertices of the polyhedron.
As such, one can think about the boundary of the feasible region as inducing a cone
graph – its vertices and edges are the vertices and edges of the polyhedron, respec-
tively. Since every vertex has d hyperplanes defining it (its basis) and an adjacent
d 
edge is defined by d − 1 of these hyperplanes, it follows that each vertex has d−1 =d
edges adjacent to it. ray
The following lemma tells us when we have an optimal vertex. While it is intu-
itively clear, its proof requires a systematic understanding of what the feasible region vertex
of a linear program looks like, and we delegate it to the next chapter.
Lemma 19.2.2. Let L be a given LPI, and let P denote its feasible region. Let v be a vertex of P, such that
all the d rays emanating from v are in the upward xd -axis direction (i.e., the direction vectors of all these d
rays have positive xd -coordinate). Then v is the lowest (in the xd -axis direction) point in P and it is thus the
optimal solution to L.

Interestingly, when we are at a vertex v of the feasible region, it is easy to find the adjacent vertices. Indeed,
compute the d rays emanating from v. For such a ray, intersect it with all the constraints of the LPI. The closest
intersection point along this ray is the vertex u of the feasible region adjacent to v. Doing this naively takes
O dn + d O(1) time.
Lemma 19.2.2 offers a simple algorithm for computing the optimal solution for an LPI. Start from a feasible
vertex of the LPI. As long as this vertex has at least one ray that points downward, follow this ray to an adjacent
vertex on the feasible polytope that is lower than the current vertex (i.e., compute the d rays emanating from
the current vertex, and follow one of the rays that points downward, till you hit a new vertex). Repeat this till
the current vertex has all rays pointing upward, by Lemma 19.2.2 this is the optimal solution. Up to tedious
(and non-trivial) details this is the simplex algorithm.
We need the following lemma, whose proof is also delegated to the next chapter.
Lemma 19.2.3. If L is an LPI in d dimensions which is not feasible, then there exist d + 1 inequalities in L
which are infeasible on their own.

Note that given a set of d + 1 inequalities, it is easy to verify (in polynomial time in d) if they are feasible
or not. Indeed, compute the d+1 d vertices formed by this set of constraints, and check whether any of these
vertices are feasible (for these d + 1 constraints). If all of them are infeasible, then this set of constraints is
infeasible.

19.3. Low-dimensional linear programming


19.3.1. An algorithm for a restricted case
There are a lot of tedious details that one has to take care of to make things work with linear programming.
As such, we will first describe the algorithm for a special case and then provide the envelope required so that
one can use it to solve the general case.
A vertex v is acceptable if all the d rays associated with it point upward (note that the vertex might not
be feasible). The optimal solution (if it is finite) must be located at an acceptable vertex.

Input for the restricted case. The input for the restricted case is an LPI L, which is defined by a set of n
linear inequalities in Rd , and a basis B = {h1, . . . , hd } of an acceptable vertex.

132
Let hd+1, . . . , hn be a random permutation of the remaining constraints of the LPI L.
We are looking for the lowest point in Rd which is feasible for L. Our algorithm is randomized incremental.
At the ith step, for i > d, it will maintain the optimal solution for the first i constraints. As such, in the ith
step, the algorithm checks whether the optimal solution vi−1 of the previous iteration is still feasible with the
new constraint hi (namely, the algorithm checks if vi−1 is inside the halfspace defined by hi ). If vi−1 is still
feasible, then it is still the optimal solution, and we set vi ← vi−1 .
The more interesting case is when vi−1 < hi . First, we check if the basis of vi−1 together with hi forms a set
of constraints which is infeasible. If so, the given LPI is infeasible, and we output B(vi−1 ) ∪ {hi } as the proof of
infeasibility.
Otherwise, the new optimal solution must lie on the hyperplane associated
feasible region
with hi . As such, we recursively compute the lowest vertex in the (d − 1)-
h1 ∩ . . . ∩ hi−1
dimensional polyhedron (∂hi ) ∩ i−1 j=1 h j , where ∂hi denotes the hyperplane
Ñ
which is the boundary of the halfspace hi . This is a linear program involving ∂hi hi
i − 1 constraints, and it involves d − 1 variables since the LPI lies on the (d − 1)-
dimensional hyperplane ∂hi . The solution found, vi , is defined by a basis of
d − 1 constraints in the (d − 1)-dimensional subspace ∂hi , and adding hi to it vi
results in an acceptable vertex that is feasible in the original d-dimensional vi−1
space. We continue to the next iteration.
Clearly, the vertex vn is the required optimal solution.

19.3.1.1. Running time analysis


Every set of d constraints is feasible and computing the vertex formed by this constraint takes O(d 3 ) time, by
Lemma 19.2.1.
Let Xi be an indicator variable that is 1 if and only if the vertex vi is recomputed in the ith iteration (by
performing a recursive call). This happens only if hi is one of the d constraints in the basis of vi . Since there
are most d constraints that define the basis and there are at least i − d constraints that are being randomly
ordered (as the first d slots are fixed), we have that the probability that vi , vi−1 is

d
 
2d
αi = P[Xi = 1] ≤ min , 1 ≤ ,
i−d i

for i ≥ d + 1, as can be easily verified.¬ So, let T(m, d) be the expected time to solve an LPI with m constraints
in d dimensions. We have that T(d, d) = O(d 3 ) by the above. Now, in every iteration, we need to check if the
current solution lies inside the new constraint, which takes O(d) time per iteration and O(dm) time overall.
Now, if Xi = 1, then we need to update each of the i − 1 constraints to lie on the hyperplane hi . The
hyperplane hi defines a linear equality, which we can use to eliminate one of the variables. This takes O(di)
time, and we have to do the recursive call. The probability that this happens is αi . As such, we have
" m
#
Õ
T(m, d) = E O(md) + Xi (di + T(i − 1, d − 1))
i=d+1
m
Õ
= O(md) + αi (di + T(i − 1, d − 1))
i=d+1
m
Õ 2d
= O(md) + (di + T(i − 1, d − 1))
i=d+1
i
m
  Õ 2d
= O md 2
+ T(i − 1, d − 1).
i=d+1
i

¬ Indeed, (d)+d d and d = 1.


(i−d)+d lies between i−d d

133
feasible
region

h
v
empty intersection
on h
(a) (b) (c)

Figure 19.1: Demonstrating the algorithm for the general case: (a) given constraints and feasible region, (b)
constraints moved to pass through the origin, and (c) the resulting acceptable vertex v.

Guessing that T(m, d) ≤ cd m, we have that


m m
Õ 2d Õ  
T(m, d) ≤ cb1 md +
2
cd−1 (i − 1) ≤ cb1 md +
2
2dcd−1 = cb1 d 2 + 2dcd−1 m,
i=d+1
i i=d+1
 
where cb1 is some absolute constant. We need that cb1 d 2 + 2cd−1 d ≤ cd , which holds for cd = O (3d)d and
 
T(m, d) = O (3d)d m .

Lemma 19.3.1. Given an LPI with n constraints  in d dimensions and an acceptable vertex for this LPI, then
can compute the optimal solution in expected O (3d)d n time.

19.3.2. The algorithm for the general case


Let L be the given LPI, and let L 0 be the instance formed by translating all the constraints so that they pass
through the origin. Next, let h be the hyperplane xd = −1. Consider a solution to the LP L 0 when restricted to
h. This is a (d − 1)-dimensional instance of linear programming, and it can be solved recursively.
If the recursive call on L 0 ∩ h returned no solution, then the d constraints that prove that the LP L 0 is
infeasible on h corresponds to a basis in L of a vertex v which is acceptable in the original LPI. Indeed, as
we move these d constraints to the origin, their intersection on h is empty (i.e., the “quadrant” that their
intersection forms is unbounded only in the upward direction). As such, we can now apply the algorithm of
Lemma 19.3.1 to solve the given LPI. See Figure 19.1.
If there is a solution to L 0 ∩ h, then it is a vertex v on h which is feasible. Thus, consider the original set
of d − 1 constraints in L that corresponds to the basis B of v. Let ` be the line formed by the intersection of
the hyperplanes of B. It is now easy to verify that the intersection of the feasible region with this line is an
unbounded ray, and the algorithm returns this unbounded (downward oriented) ray, as a proof that the LPI is
unbounded.

Theorem  19.3.2. Given an LP instance with n constraints defined over d variables, it can be solved in expected
O (3d) n time.
d

Proof: The expected running time is

S(n, d) = O(nd) + S(n, d − 1) + T(m, d),

134
where T(m, d) is the time to solve an LP in the restricted case of Section 19.3.1. Indeed, we first solve the problem
on the (d − 1)-dimensional subspace h ≡ xd = −1. This takes O(dn) + S(n, d − 1) time (we need to rewrite the
constraints for the lower-dimensional instance, and that takes O(dn) time). If the solution on h is feasible, then
the original LPI has an unbounded solution, and we return it. Otherwise, we obtained an acceptable vertex,
and we can use the special case algorithm on the original LPI. Now, the solution to this recurrence is O (3d)d n ;


see Lemma 19.3.1.

Chapter 20

Linear Programming

20.1. Introduction and Motivation


In the VCR/guns/nuclear-bombs/napkins/star-wars/professors/butter/mice problem, the benevolent dictator,
Biga Piguinus, of Penguina (a country in south Antarctica having 24 million penguins under its control) has
to decide how to allocate her empire resources to the maximal benefit of her penguins. In particular, she has
to decide how to allocate the money for the next year budget. For example, buying a nuclear bomb has a
tremendous positive effect on security (the ability to destruct yourself completely together with your enemy
induces a peaceful serenity feeling in most people). Guns, on the other hand, have a weaker effect. Penguina
(the state) has to supply a certain level of security. Thus, the allocation should be such that:

xgun + 1000 ∗ xnuclear−bomb ≥ 1000,

where xguns is the number of guns constructed, and xnuclear−bomb is the number of nuclear-bombs constructed.
On the other hand,
100 ∗ xgun + 1000000 ∗ xnuclear−bomb ≤ xsecurity
where xsecurity is the total Penguina is willing to spend on security, and 100 is the price of producing a single
gun, and 1, 000, 000 is the price of manufacturing one nuclear bomb. There are a lot of other constrains of this
type, and Biga Piguinus would like to solve them, while minimizing the total money allocated for such spending
(the less spent on budget, the larger the tax cut).
More formally, we have a (potentially large) number of variables: x1, . . . , a11 x1 + . . . + a1n xn ≤ b1
xn and a (potentially large) system of linear inequalities. We will refer to a21 x1 + . . . + a2n xn ≤ b2
such an inequality as a constraint. We would like to decide if there is an ..
.
assignment of values to x1, . . . , xn where all these inequalities are satisfied.
am1 x1 + . . . + amn xn ≤ bm
Since there might be infinite number of such solutions, we want the solution
max c1 x1 + . . . + cn xn .
that maximizes some linear quantity. See the instance on the right.
The linear target function we are trying to maximize is known as the objective function of the linear
program. Such a problem is an instance of linear programming. We refer to linear programming as LP.

20.1.1. History
Linear programming can be traced back to the early 19th century. It started in earnest in 1939, when L. V.
Kantorovich noticed the importance of certain type of Linear Programming problems. Unfortunately, for several

135
∀(u, v) ∈ E 0 ≤ xu→v
xu→v ≤ c(u → v)
Õ Õ
∀v ∈ V \ {s, t} xu→v − xv→w ≤ 0
(u,v)∈E (v,w)∈E
Õ Õ
xu→v − xv→w ≥ 0
(u,v)∈E (v,w)∈E
xs→u
Í
maximizing (s,u)∈E

Figure 20.1

n
Õ
max cj x j
j=1
Õn
subject to ai j x j ≤ bi
j=1
for i = 1, 2, . . . , m.

Figure 20.2

years, Kantorovich work was unknown in the west and unnoticed in the east.
Dantzig, in 1947, invented the simplex method for solving LP problems for the US Air force planning
problems.
T. C. Koopmans, in 1947, showed that LP provide the right model for the analysis of classical economic
theories.
In 1975, both Koopmans and Kantorovich got the Nobel prize of economics. Dantzig probably did not get it
because his work was too mathematical. That is how it goes. Kantorovich was the only the Russian economist
that got the Nobel prize¬ .

20.1.2. Network flow via linear programming


To see the impressive expressive power of linear programming, we next show that network flow can be solved
using linear programming. Thus, we are given an instance of max flow; namely, a network flow G = (V, E) with
source s and sink t, and capacities c(·) on the edges. We would like to compute the maximum flow in G.
To this end, for an edge (u, v) ∈ E, let xu→v be a variable which is the amount of flow assign to (u, v) in
the maximum flow. We demand that 0 ≤ xu→v and xu→v ≤ c(u → v) (flow is non negative on edges, and it
comply with the capacity constraints). Next, for any vertex v which is not the source or the sink, we require
that (u,v)∈E xu→v = (v,w)∈E xv→w (this is conservation of flow). Note, that an equality constraint a = b can
Í Í
be rewritten as two inequality constraints a ≤ b and b ≤ a. Finally, under all these constraints, we are interest
in the maximum flow. Namely, we would like to maximize the quantity (s,u)∈E xs→u . Clearly, putting all these
Í
constraints together, we get the linear program depicted in Figure 20.1.
It is not too hard to write down min-cost network flow using linear programming.

136
20.2. The Simplex Algorithm
20.2.1. Linear program where all the variables are positive
We are given a LP, depicted in Figure 20.2, where a variable can have any real value. As a first step to solving it,
we would like to rewrite it, such that every variable is non-negative. This is easy to do, by replacing a variable
xi by two new variables xi0 and xi00, where xi = xi0 − xi00, xi0 ≥ 0 and xi00 ≥ 0. For example, the (trivial) linear
program containing the single constraint 2x + y ≥ 5 would be replaced by the following LP: 2x 0 −2x 00 + y 0 − y 00 ≥ 5,
x 0 ≥ 0, y 0 ≥ 0, x 00 ≥ 0 and y 00 ≥ 0.
Lemma 20.2.1. Given an instance I of LP, one can rewrite it into an equivalent LP, such that all the variables
must be non-negative. This takes linear time in the size of I.

20.2.2. Standard form


Using Lemma 20.2.1, we can now require a LP to be specified using only positive variables. This is known as
standard form.
A linear program in standard form.
n A linear program in standard form.
Õ
max cj x j (Matrix notation.)
j=1
Õn max cT x
subject to ai j x j ≤ bi for i = 1, 2, . . . , m subject to Ax ≤ b.
j=1
x ≥ 0.
xj ≥ 0 for j = 1, . . . , n.
Here the matrix notation rises, by setting
a11 a12 ... a1(n−1) a1n x1
c1 b ­ a21 a22 ... a2(n−1) a2n ­ x2
© ª © ª
© 1 ª
® ®
©
c = ­­ .. ®, b = ­ .. ®, A = ­
ª ­ .. .. ®
®, and x = ­­ ...
­ ®
®.
. ® ­ . ® ­ . ... ... ... . ® ®
cn « bm ¬ ­ a(m−1)1 a(m−1)2 . . . a(m−1)(n−1) a(m−1)n ­ xn−1
­ ® ­ ®
« ¬ ® ®
« am1 am2 ... am(n−1) amn ¬ « xn ¬
Note, that c, b and A are prespecified, and x is the vector of unknowns that we have to solve the LP for.
In the following in order to solve the LP, we are going to do a long sequence of rewritings till we reach the
optimal solution.

20.2.3. Slack Form


We next rewrite the LP into slack form. It is a more convenient­ form for describing the Simplex algorithm
for solving LP.
Specifically, one can rewrite a LP, so that every inequality becomes equality, max cT x
and all variables must be positive; namely, the new LP will have a form depicted on
subject to Ax = b.
the right (using matrix notation). To this end, we introduce new variables (slack
variables) rewriting the inequality x ≥ 0.
Õn
ai xi ≤ b
i=1
¬ There were other economists that were born in Russia, but lived in the west that got the Nobel prize – Leonid Hurwicz for
example.
­ The word convenience is used here in the most liberal interpretation possible.

137
as
n
Õ
xn+1 = b − ai xi
i=1
xn+1 ≥ 0.

Intuitively, the value of the slack variable xn+1 encodes how far is the original inequality for holding with
equality.
Now, we have a special variable for each inequality in the LP (this is xn+1 in the above example). These
variables are special, and would be called basic variables. All the other variables on the right side are
nonbasic variables (original isn’t it?). A LP in this form is in slack form.
The slack form is defined by a tuple (N, B, A, b, c, v).
B - Set of indices of basic variables
Linear program in slack form. N - Set of indices of nonbasic variables
Õ n = |N | - number of original variables
max z=v+ cj x j, b, c - two vectors of constants
j ∈N
Õ m = |B| - number of basic variables
s.t. xi = bi − ai j x j f or i ∈ B,  (i.e., number of inequalities)
j ∈N A = ai j - The matrix of coefficients
xi ≥ 0, ∀i = 1, . . . , n + m. N ∪ B = {1, . . . , n + m}
v - objective function constant.

Exercise 20.2.2. Show that any linear program can be transformed into equivalent slack form.

Example 20.2.3. Consider the following LP which is in slack form, and its translation into the tuple (N, B, A, b, c, v).
1 1 2 B = {1, 2, 4} , N = {3, 5, 6}
max z = 29 − x3 − x5 − x6 a a15 a16 −1/6 −1/6 1/3
9 9 9 © 13
1 1 1 A = ­ a23 a25 a26 ® = ­ 8/3 2/3 −1/3 ®
ª © ª
x1 = 8 + x3 + x5 − x6
6 6 3 « a43 a45 a46 ¬ « 1/2 −1/2 0 ¬
8 2 1 b 8 c −1/9
x2 = 4 − x3 − x5 + x6 © 1 ª © © 3 ª ©
3 3 3 b = ­ b2 ® = ­ 4 ® c = ­ c5 ® = ­ −1/9 ®
ª ª
1 1
x4 = 18 − x3 + x5 « b4 ¬ « 18 ¬ « c6 ¬ « −2/9 ¬
2 2
v = 29.

Note that indices depend on the sets N and B, and also that the entries in A are negation of what they appear
in the slack form.

20.2.4. The Simplex algorithm by example


Before describing the Simplex algorithm in detail, it would be beneficial to derive it on an example. So, consider
the following LP.

max 5x1 + 4x2 + 3x3


s.t. 2x1 + 3x2 + x3 ≤ 5
4x1 + x2 + 2x3 ≤ 11
3x1 + 4x2 + 2x3 ≤ 8
x1, x2, x3 ≥ 0

138
Next, we introduce slack variables, for example, rewriting 2x1 + 3x2 + x3 ≤ 5 as the constraints: w1 ≥ 0 and
w1 = 5 − 2x1 − 3x2 − x3 . The resulting LP in slack form is
max z= 5x1 + 4x2 + 3x3
s.t. w1 = 5 − 2x1 − 3x2 − x3
w2 = 11 − 4x1 − x2 − 2x3
w3 = 8 − 3x1 − 4x2 − 2x3
x1, x2, x3, w1, w2, w3 ≥ 0
Here w1, w2, w3 are the slack variables. Note also that they are currently also the basic variables. Consider the
slack representation trivial solution, where all the non-basic variables are assigned zero; namely, x1 = x2 = x3 = 0.
We then have that w1 = 5, w2 = 11 and w3 = 8. Fortunately for us, this is a feasible solution, and the associated
objective value is z = 0.
We are interested in further improving the value of the objective function (i.e., z), while still having a feasible
solution. Inspecting carefully the above LP, we realize that all the basic variables w1 = 5, w2 = 11 and w3 = 8
have values which are strictly larger than zero. Clearly, if we change the value of one non-basic variable a bit,
all the basic variables would remain positive (we are thinking about the above system as being function of the
nonbasic variables x1, x2 and x3 ). So, consider the objective function z = 5x1 + 4x2 + 3x3 . Clearly, if we increase
the value of x1 , from its current zero value, then the value of the objective function would go up, since the
coefficient of x1 for z is a positive number (5 in our example).
Deciding how much to increase the value of x1 is non-trivial. Indeed, as we increase the value of x1 , the the
solution might stop being feasible (although the objective function values goes up, which is a good thing). So,
let us increase x1 as much as possible without violating any constraint. In particular, for x2 = x3 = 0 we have
that
w1 = 5 − 2x1 − 3x2 − x3 = 5 − 2x1
w2 = 11 − 4x1 − x2 − 2x3 = 11 − 4x1
w3 = 8 − 3x1 − 4x2 − 2x3 = 8 − 3x1 .
We want to increase x1 as much as possible, as long as w1, w2, w3 are non-negative. Formally, the constraints
are that
w1 = 5 − 2x1 ≥ 0,
w2 = 11 − 4x1 ≥ 0,
and w3 = 8 − 3x1 ≥ 0.
This implies that whatever value we pick for x1 it must comply with the inequalities x1 ≤ 2.5, x1 ≤ 11/4 = 2.75
and x1 ≤ 8/3 = 2.66. We select as the value of x1 the largest value that still comply with all these conditions.
Namely, x1 = 2.5. Putting it into the system, we now have a solution which is
x1 = 2.5, x2 = 0, x3 = 0, w1 = 0, w2 = 1, w3 = 0.5 ⇒ z = 5x1 + 4x2 + 3x3 = 12.5.
As such, all the variables are non-negative and this solution is feasible. Furthermore, this is a better solution
than the previous one, since the old solution had (the objective function) value z = 0.
What really happened? One zero nonbasic variable (i.e., x1 ) became non-zero, and one basic variable became
zero (i.e., w1 ). It is natural now to want to exchange between the nonbasic variable x1 (since it is no longer
zero) and the basic variable w1 . This way, we will preserve the invariant, that the current solution we maintain
is the one where all the nonbasic variables are assigned zero.
So, consider the equality in the LP that involves w1 , that is w1 = 5 − 2x1 − 3x2 − x3 . We can rewrite this
equation, so that x1 is on the left side:
x1 = 2.5 − 0.5w1 − 1.5x2 − 0.5 x3 . (20.1)

139
The problem is that x1 still appears in the right side of the equations for w2 and w3 in the LP. We observe,
however, that any appearance of x1 can be replaced by substituting it by the expression on the right side of
Eq. (20.1). Collecting similar terms, we get the following equivalent LP:

max z = 12.5 − 2.5w1 − 3.5x2 + 0.5x3


x1 = 2.5 − 0.5w1 − 1.5x2 − 0.5x3
w2 = 1 + 2w1 + 5x2
w3 = 0.5 + 1.5w1 + 0.5x2 − 0.5x3 .

Note, that the nonbasic variables are now {w1, x2, x3 } and the basic variables are {x1, w2, w3 }. In particular, the
trivial solution, of assigning zero to all the nonbasic variables is still feasible; namely we set w1 = x2 = x3 = 0.
Furthermore, the value of this solution is 12.5.
This rewriting step, we just did, is called pivoting. And the variable we pivoted on is x1 , as x1 was transfered
from being a nonbasic variable into a basic variable.
We would like to continue pivoting till we reach an optimal solution. We observe, that we can not pivot on
w1 , since if we increase the value of w1 then the objective function value goes down, since the coefficient of w1
is −2.5. Similarly, we can not pivot on x2 since its coefficient in the objective function is −3.5. Thus, we can
only pivot on x3 since its coefficient in the objective function is 0.5, which is a positive number.
Checking carefully, it follows that the maximum we can increase x3 is to 1, since then w3 becomes zero.
Thus, rewriting the equality for w3 in the LP; that is,

w3 = 0.5 + 1.5w1 + 0.5x2 − 0.5x3,

for x3 , we have
x3 = 1 + 3w1 + x2 − 2w3,
Substituting this into the LP, we get the following LP.

max z= 13 − w1 − 3x2 − w3
s.t. x1 = 2 − 2w1 − 2x2 + w3
w2 = 1 + 2w1 + 5x2
x3 = 1 + 3w1 + x2 − 2w3

Can we further improve the current (trivial) solution that assigns zero to all the nonbasic variables? (Here
the nonbasic variables are {w1, x2, w3 }.)
The resounding answer is no. We had reached the optimal solution. Indeed, all the coefficients in the
objective function are negative (or zero). As such, the trivial solution (all nonbasic variables get zero) is
maximal, as they must all be non-negative, and increasing their value decreases the value of the objective
function. So we better stop.

Intuition. The crucial observation underlining our reasoning is that at each stage we had to replace the LP
by a completely equivalent LP. In particular, any feasible solution to the original LP would be feasible for the
final LP (and vice versa). Furthermore, they would have exactly the same objective function value. However,
in the final LP, we get an objective function that can not be improved for any feasible point, an we stopped.
Thus, we found the optimal solution to the linear program.
This gives a somewhat informal description of the simplex algorithm. At each step we pivot on a nonbasic
variable that improves our objective function till we reach the optimal solution. There is a problem with our
description, as we assumed that the starting (trivial) solution of assigning zero to the nonbasic variables is
feasible. This is of course might be false. Before providing a formal (and somewhat tedious) description of the
above algorithm, we show how to resolve this problem.

140
20.2.4.1. Starting somewhere

We had transformed a linear programming problem into


Õ
max z=v+ cj x j,
j ∈N slack form. Intuitively, what the Simplex algorithm is going
Õ to do, is to start from a feasible solution and start walking
s.t. xi = bi − ai j x j for i ∈ B,
around in the feasible region till it reaches the best possible
j ∈N
point as far as the objective function is concerned. But
xi ≥ 0, ∀i = 1, . . . , n + m. maybe the linear program L is not feasible at all (i.e., no
solution exists.). Let L be a linear program (in slack form depicted on the left. Clearly, if we set all xi = 0 if
i ∈ N then this determines the values of the basic variables. If they are all positive, we are done, as we found a
feasible solution. The problem is that they might be negative.
We generate a new LP problem L 0 from L. This min x0
LP L 0 = Feasible(L) is depicted on the right. Clearly, Õ
s.t. xi = x0 + bi − ai j x j for i ∈ B,
if we pick x j = 0 for all j ∈ N (all the nonbasic
j ∈N
variables), and a value large enough for x0 then all
xi ≥ 0, ∀i = 0, . . . , n + m.
the basic variables would be non-negatives, and as
such, we have found a feasible solution for L 0. Let LPStartSolution(L 0) denote this easily computable feasible
solution.
We can now use the Simplex algorithm we described to find this optimal solution to L 0 (because we have a
feasible solution to start from!).

Lemma 20.2.4. The LP L is feasible if and only if the optimal objective value of LP L 0 is zero.

Proof: A feasible solution to L is immediately an optimal solution to L 0 with x0 = 0, and vice versa. Namely,
given a solution to L 0 with x0 = 0 we can transform it to a feasible solution to L by removing x0 .

One technicality that is ignored above, is that the starting solution we have for L 0, generated by LPStartSolution(L 0)
is not legal as far as the slack form is concerned, because the non-basic variable x0 is assigned a non-zero value.
However, this can be easily resolved by immediately pivoting on x0 when we run the Simplex algorithm. Namely,
we first try to decrease x0 as much as possible.

Chapter 21

Linear Programming II

21.1. The Simplex Algorithm in Detail

141
B - Set of indices of basic variables
N - Set of indices of nonbasic variables Õ
n = |N | - number of original variables max z=v+ cj x j,
b, c - two vectors of constants j ∈N
Õ
m = |B| - number of basic variables (i.e., number s.t. xi = bi − ai j x j for i ∈ B,
of inequalities) j ∈N
A = ai j - The matrix of coefficients

xi ≥ 0, ∀i = 1, . . . , n + m.
N ∪ B = {1, . . . , n + m}
v - objective function constant.
(i) (ii)

Figure 21.2: A linear program in slack form is specified by a tuple (N, B, A, b, c, v).

The Simplex algorithm is presented on the right.


Simplex( bL a LP )
We assume that we are given SimplexInner, a black
Transform b L into slack form.
box that solves a LP if the trivial solution of assigning
Let L be the resulting slack form.
zero to all the nonbasic variables is feasible. We re-
L 0 ← Feasible(L)
mind the reader that L 0 = Feasible(L) returns a new
x ← LPStartSolution(L 0)
LP for which we have an easy feasible solution. This
x 0 ← SimplexInner(L 0, x) (*)
is done by introducing a new variable x0 into the LP,
z ← objective function value of x0
where the original LP b L is feasible if and only if the
if z > 0 then
new LP L has a feasible solution with x0 = 0. As such,
return “No solution”
we set the target function in L to be minimizing x0 .
x 00 ← SimplexInner(L, x 0 )
We now apply SimplexInner to L 0 and the easy
return x 00
solution computed for L 0 by LPStartSolution(L 0). If
x0 > 0 in the optimal solution for L 0 then there is no Figure 21.1: The Simplex algorithm.
feasible solution for L, and we exit. Otherwise, we
found a feasible solution to L, and we use it as the starting point for SimplexInner when it is applied to L.
Thus, in the following, we have to describe SimplexInner - a procedure to solve an LP in slack form, when
we start from a feasible solution defined by the nonbasic variables assigned value zero.
One technicality that is ignored above, is that the starting solution we have for L 0, generated by LPStart-
Solution(L) is not legal as far as the slack form is concerned, because the non-basic variable x0 is assigned
a non-zero value. However, this can be easily resolve by immediately pivot on x0 when we execute (*) in
Figure 21.1. Namely, we first try to decrease x0 as much as possible.

21.2. The SimplexInner Algorithm


We next describe the SimplexInner algorithm.
We remind the reader that the LP is given to us in slack form, see Figure 21.2. Furthermore, we assume
that the trivial solution x = τ, which is assigning all nonbasic variables zero, is feasible. In particular, we
immediately get the objective value for this solution from the notation which is v.
Assume, that we have a nonbasic variable xe that appears in the objective function, and furthermore its
coefficient ce is positive in (the objective function), which is z = v + j ∈N c j x j . Formally, we pick e to be one of
Í
the indices of
n o
j c j > 0, j ∈ N .

The variable xe is the entering variable variable (since it is going to join the set of basic variables).
Clearly, if we increase the value of xe (from the current value of 0 in τ) then one of the basic variables is
going to vanish (i.e., become zero). Let xl be this basic variable. We increase the value of xe (the entering

142
variable) till xl (the leaving variable) becomes zero.
Setting all nonbasic variables to zero, and letting xe grow, implies that xi = bi − aie xe , for all i ∈ B.
All those variables must be non-negative, and thus we require that ∀i ∈ B it holds xi = bi − aie xe ≥ 0.
1 aie 1 aie
Namely, xe ≤ (bi /aie ) or alternatively, ≥ . Namely, ≥ max and, the largest value of xe which is
xe bi xe i ∈B bi
still feasible is
 −1
aie

U = max .
i ∈B bi

We pick l (the index of the leaving variable) from the set all basic variables that vanish to zero when xe = U.
Namely, l is from the set
a je
 
j = U where j ∈ B .
bj
Now, we know xe and xl . We rewrite the equation for xl in the LP so that it has xe on the left side. Formally,
we do
Õ bl Õ al j
xl = bl − al j x j ⇒ xe = − xj, where all = 1.
j ∈N
ale ale
j ∈N ∪{l }

We need to remove all the appearances on the right side of the LP of xe . This can be done by substituting
xe into the other equalities, using the above equality. Alternatively, we do beforehand Gaussian elimination, to
remove any appearance of xe on the right side of the equalities in the LP (and also from the objective function)
replaced by appearances of xl on the left side, which we then transfer to the right side.
In the end of this process, we have a new equivalent LP where the basic variables are B 0 = (B \ {l}) ∪ {e}
and the non-basic variables are N 0 = (N \ {e}) ∪ {l}.
In end of this pivoting stage the LP objective function value had increased, and as such, we made progress.
Note, that the linear system is completely defined by which variables are basic, and which are non-basic.
Furthermore, pivoting never returns to a combination (of basic/non-basic variable) that was already visited.
Indeed, we improve the value of the objective function in each pivoting stage. Thus, we can do at most
n+m n + m n
  
≤ ·e
n n
pivoting steps. And this is close to tight in the worst case (there are examples where 2n pivoting steps are
needed).
Each pivoting step takes polynomial time in n and m. Thus, the overall running time of Simplex is exponential
in the worst case. However, in practice, Simplex is extremely fast.

21.2.1. Degeneracies
If you inspect carefully the Simplex algorithm, you would notice that it might get stuck if one of the bi s is zero.
This corresponds to a case where > m hyperplanes passes through the same point. This might cause the effect
that you might not be able to make any progress at all in pivoting.
There are several solutions, the simplest one is to add tiny random noise to each coefficient. You can even
do this symbolically. Intuitively, the degeneracy, being a local phenomena on the polytope disappears with high
probability.
The larger danger, is that you would get into cycling; namely, a sequence of pivoting operations that do not
improve the objective function, and the bases you get are cyclic (i.e., infinite loop).
There is a simple scheme based on using the symbolic perturbation, that avoids cycling, by carefully choosing
what is the leaving variable. This is described in detail in Section 21.6.
There is an alternative approach, called Bland’s rule, which always choose the lowest index variable for
entering and leaving out of the possible candidates. We will not prove the correctness of this approach here.

143
21.2.2. Correctness of linear programming
Definition 21.2.1. A solution to an LP is a basic solution if it the result of setting all the nonbasic variables
to zero.

Note that the Simplex algorithm deals only with basic solutions. In particular we get the following.

Theorem 21.2.2 (Fundamental theorem of Linear Programming.). For an arbitrary linear program,
the following statements are true:
(A) If there is no optimal solution, the problem is either infeasible or unbounded.
(B) If a feasible solution exists, then a basic feasible solution exists.
(C) If an optimal solution exists, then a basic optimal solution exists.

Proof: Proof is constructive by running the simplex algorithm.

21.2.3. On the ellipsoid method and interior point methods


The Simplex algorithm has exponential running time in the worst case.
The ellipsoid method is weakly polynomial (namely, it is polynomial in the number of bits of the input).
Khachian in 1979 came up with it. It turned out to be completely useless in practice.
In 1984, Karmakar came up with a different method, called the interior-point method which is also weakly
polynomial. However, it turned out to be quite useful in practice, resulting in an arm race between the interior-
point method and the simplex method.
The question of whether there is a strongly polynomial time algorithm for linear programming, is one of the
major open questions in computer science.

21.3. Duality and Linear Programming


Every linear program L has a dual linear program L 0. Solving the dual problem is essentially equivalent to
solving the primal linear program (i.e., the original) LP.

21.3.1. Duality by Example

Consider the linear program L depicted on the right (Figure 21.3). Note, max z = 4x1 + x2 + 3x3
that any feasible solution, gives us a lower bound on the maximal value of the
s.t. x1 + 4x2 ≤ 1
target function, denoted by η. In particular, the solution x1 = 1, x2 = x3 = 0
is feasible, and implies z = 4 and thus η ≥ 4. 3x1 − x2 + x3 ≤ 3
Similarly, x1 = x2 = 0, x3 = 3 is feasible and implies that η ≥ z = 9. x1, x2, x3 ≥ 0
We might be wondering how close is this solution to the optimal solution? Figure 21.3: The linear pro-
In particular, if this solution is very close to the optimal solution, we might gram L.
be willing to stop and be satisfied with it.
Let us add the first inequality (multiplied by 2) to the second inequality (multiplied by 3). Namely, we add
the two inequalities:

2( x1 + 4x2 ) ≤ 2(1)
+3(3x1 − x2 + x3 ) ≤ 3(3).

The resulting inequality is

11x1 + 5x2 + 3x3 ≤ 11. (21.1)

144
Note, that this inequality must hold for any feasible solution of L. Now, the objective function is z = 4x1 +x2 +3x3
and x1, x2 and x3 are all non-negative, and the inequality of Eq. (21.1) has larger coefficients that all the
coefficients of the target function, for the corresponding variables. It thus follows, that for any feasible solution,
we have

z = 4x1 + x2 + 3x3 ≤ 11x1 + 5x2 + 3x3 ≤ 11,

since all the variables are non-negative. As such, the optimal value of the LP L is somewhere between 9 and 11.
We can extend this argument. Let us multiply the first inequality by y1 and second inequality by y2 and
add them up. We get:
y1 (x1 + 4x2 )≤ y1 (1)
+ y2 (3x1 - x2 + x3 )≤ y2 (3) (21.2)
(y1 + 3y2 )x1 + (4y1 − y2 )x2 + y2 x3 ≤ y1 + 3y2 .

Compare this to the target function z = 4x1 + x2 + 3x3 . If this expression is bigger than the target function
in each variable, namely

min y1 + 3y2
4 ≤ y1 + 3y2 s.t. y1 + 3y2 ≥ 4
1 ≤ 4y1 − y2 4y1 − y2 ≥ 1
3 ≤ y2 , y2 ≥ 3
y1, y2 ≥ 0.
then, z = 4x1 + x2 + 3x3 ≤ (y1 + 3y2 )x1 + (4y1 − y2 )x2 + y2 x3 ≤ y1 + 3y2 , the last
step follows by Eq. (21.2). Figure 21.4: The dual LP bL.
Thus, if we want the best upper bound on η (the maximal value of z) then The primal LP is depicted in
we want to solve the LP b L depicted in Figure 21.4. This is the dual program Figure 21.3.
to L and its optimal solution is an upper bound to the optimal solution for L.

21.3.2. The Dual Problem


Given a linear programming problem (i.e., primal problem, seen in Figure 21.5 (a), its associated dual linear
program is in Figure 21.5 (b). The standard form of the dual LP is depicted in Figure 21.5 (c). Interestingly, you
can just compute the dual LP to the given dual LP. What you get back is the original LP. This is demonstrated
in Figure 21.6.
We just proved the following result.
Lemma 21.3.1. Let L be an LP, and let L 0 be its dual. Let L 00 be the dual to L 0. Then L and L 00 are the same
LP.

21.3.3. The Weak Duality Theorem


Theorem 21.3.2. If (x1, x2, . . . , xn ) is feasible for the primal LP and (y1, y2, . . . , ym ) is feasible for the dual LP,
then Õ Õ
cj x j ≤ bi yi .
j i

Namely, all the feasible solutions of the dual bound all the feasible solutions of the primal.

Proof: By substitution from the dual form, and since the two solutions are feasible, we know that
m
! !
Õ Õ Õ Õ Õ Õ
cj x j ≤ yi ai j x j ≤ ai j x j yi ≤ bi yi .
j j i=1 i j i

145
n m
Õ Õ m
max cj x j min bi yi
Õ
max (−bi )yi
j=1 i=1
i=1
n m
Õ m
ai j yi ≥ c j ,
Õ
ai j x j ≤ bi , s.t.
Õ
s.t. s.t. (−ai j )yi ≤ −c j ,
j=1 i=1
i=1
for i = 1, . . . , m, for j = 1, . . . , n,
for j = 1, . . . , n,
x j ≥ 0, yi ≥ 0,
yi ≥ 0,
for j = 1, . . . , n. for i = 1, . . . , m.
fori = 1, . . . , m.

(a) primal program (b) dual program (c) dual program in standard
form
Figure 21.5: Dual linear programs.

n
m Õ n
min −c j x j
Õ Õ
max (−bi )yi max cj x j
j=1
i=1 j=1
m n
Õ Õ n
s.t. (−ai j )x j ≥ −bi ,
Õ
s.t. (−ai j )yi ≤ −c j , s.t. ai j x j ≤ bi ,
i=1 j=1
for i = 1, . . . , m, j=1
for j = 1, . . . , n, for i = 1, . . . , m,
x j ≥ 0,
yi ≥ 0, x j ≥ 0,
for j = 1, . . . , n.
for i = 1, . . . , m. for j = 1, . . . , n.
(b) the dual program to the dual
(a) dual program (c) ... which is the original LP.
program
Figure 21.6: The dual to the dual linear program. Computing the dual of (a) can be done mechanically by
following Figure 21.5 (a) and (b). Note, that (c) is just a rewriting of (b).

Interestingly, if we apply the weak duality theorem on the dual program (namely, Figure 21.6 (a) and (b)),
Õm n
Õ
we get the inequality (−bi )yi ≤ −c j x j , which is the original inequality in the weak duality theorem. Thus,
i=1 j=1
the weak duality theorem does not imply the strong duality theorem which will be discussed next.

21.4. The strong duality theorem


The strong duality theorem states the following.

Theorem 21.4.1. If the primal LP problem has an optimal solution x ∗ = x1∗, . . . , xn∗ then the dual also has an


optimal solution, y ∗ = y1∗, . . . , ym


∗ , such that

Õ Õ
c j x ∗j = bi yi∗ .
j i

Its proof is somewhat tedious and not very insightful, the basic idea to prove this theorem is to run the
simplex algorithm simultaneously on both the primal and the dual LP making steps in sync. When the two
stop, they must be equal of they are feasible. We omit the tedious proof.

146
21.5. Some duality examples
21.5.1. Shortest path
You are given a graph G = (V, E), with source s and target t. We have weights ω(u, v) on each edge (u, v) ∈ E, and
we are interested in the shortest path in this graph from s to t. To simplify the exposition assume that there
are no incoming edges in s and no edges leave t. To this end, let dx be a variable that is the distance between
s and x, for any x ∈ V. Clearly, we must have for any edge (u, v) ∈ E, that du + ω(u, v) ≥ dv . We also know that
ds = 0. Clearly, a trivial solution to this constraints is to set all the variables to zero. So, we are trying to find
the assignment that maximizes dt , such that all the constraints are filled. As such, the LP for computing the
shortest path from s to t is the following LP.

max dt
s.t. ds ≤ 0
du + ω(u, v) ≥ dv ∀(u, v) ∈ E,
dx ≥ 0 ∀x ∈ V.

Equivalently, we get

max dt
s.t. ds ≤ 0
dv − du ≤ ω(u, v) ∀(u, v) ∈ E,
dx ≥ 0 ∀x ∈ V.

Let use compute the dual. To this end, let yuv be the dual variable for the edge (u, v), and let ys be the dual
variable for the ds ≤ 0 inequality. We get the following dual LP.
Õ
min yuv ω(u, v)
(u,v)∈E
Õ
s.t. ys − ysu ≥ 0 (∗)
(s,u)∈E
Õ Õ
yux − yxv ≥ 0 ∀x ∈ V \ {s, t} (∗∗)
(u,x)∈E (x,v)∈E
Õ
yut ≥ 1 (∗ ∗ ∗)
(u,t)∈E

yuv ≥ 0 ∀(u, v) ∈ E,
ys ≥ 0.

Look carefully at this LP. The trick is to think about the yuv as a flow on the edge yuv . (Also, we assume
here that the weights are positive.) Then, this LP is the min cost flow of sending one unit of flow from the
source s to t. Indeed, if the weights are positive, then (**) can be assumed to be hold with equality in the
optimal solution, and this is conservation of flow. Equation (***) implies that one unit of flow arrives to the
sink t. Finally, (*) implies that at least ys units of flow leaves the source. The remaining of the LP implies that
ys ≥ 1. Of course, this min-cost flow version, is without capacities on the edges.

147
21.5.2. Set Cover and Packing
Consider an instance of Set Cover with (S, F), where S = {u1, . . . , un } and F = {F1, . . . , Fm }, where Fi ⊆ S. The
natural LP to solve this problem is
Õ
min xj
F j ∈F
Õ
s.t. xj ≥ 1 ∀ui ∈ S,
F j ∈F,
ui ∈F j

xj ≥ 0 ∀Fj ∈ F.

The dual LP is
Õ
max yi
ui ∈S
Õ
s.t. yi ≤ 1 ∀Fj ∈ F,
ui ∈F j

yi ≥ 0 ∀ui ∈ S.

This is a packing LP. We are trying to pick as many vertices as possible, such that no set has more than one
vertex we pick. If the sets in F are pairs (i.e., the set system is a graph), then the problem is known as edge
cover, and the dual problem is the familiar independent set problem. Of course, these are all the fractional
versions – getting an integral solution for these problems is completely non-trivial, and in all these cases is
impossible in polynomial time since the problems are NP-Complete.
As an exercise, write the LP for Set Cover for the case where every set has a price associated with it, and
you are trying to minimize the total cost of the cover.

21.5.3. Network flow


(We do the following in excruciating details – hopefully its make the presentation clearer.)
Let assume we are given an instance of network flow G, with source s, and sink t. As usual, let us assume
there are no incoming edges into the source, no outgoing edges from the sink, and the two are not connected
by an edge. The LP for this network flow is the following.
Õ
max xs→v
(s,v)∈E

xu→v ≤ c(u → v) ∀(u, v) ∈ E


Õ Õ
xu→v − xv→w ≤ 0 ∀v ∈ V \ {s, t}
(u,v)∈E (v,w)∈E
Õ Õ
− xu→v + xv→w ≤ 0 ∀v ∈ V \ {s, t}
(u,v)∈E (v,w)∈E

0 ≤ xu→v ∀(u, v) ∈ E.

148
To perform the duality transform, we define a dual variable for each inequality. We get the following dual LP:
Õ
max xs→v
(s,v)∈E

xu→v ≤ c(u → v) ∗ yu→v ∀(u, v) ∈ E


Õ Õ
xu→v − xv→w ≤ 0 ∗ yv ∀v ∈ V \ {s, t}
(u,v)∈E (v,w)∈E
Õ Õ
− xu→v + xv→w ≤ 0 ∗ yv0 ∀v ∈ V \ {s, t}
(u,v)∈E (v,w)∈E

0 ≤ xu→v ∀(u, v) ∈ E.

Now, we generate the inequalities on the coefficients of the variables of the target functions. We need to carefully
account for the edges, and we observe that there are three kinds of edges: source edges, regular edges, and sink
edges. Doing the duality transformation carefully, we get the following:
Õ
min c(u → v) yu→v
(u,v)∈E

1 ≤ ys→v + yv − yv0 ∀(s, v) ∈ E


0≤ yu→v + yv − yv0 − yu + yu0 ∀(u, v) ∈ E(G \ {s, t})
0≤ yv→t − yv + yv0 ∀(v, t) ∈ E
yu→v ≥ 0 ∀(u, v) ∈ E
yv ≥ 0 ∀v ∈ V
yv0 ≥0 ∀v ∈ V

To understand what is going on, let us rewrite the LP, introducing the variable dv = yv − yv0 , for each v ∈ V ¬ .
We get the following modified LP:
Õ
min c(u → v) yu→v
(u,v)∈E

1 ≤ ys→v + dv ∀(s, v) ∈ E
0 ≤ yu→v + dv − du ∀(u, v) ∈ E(G \ {s, t})
0 ≤ yv→t − dv ∀(v, t) ∈ E
yu→v ≥ 0 ∀(u, v) ∈ E

Adding the two variables for t and s, and setting their values as follows dt = 0 and ds = 1, we get the following
LP:
Õ
min c(u → v) yu→v
(u,v)∈E

0 ≤ ys→v + dv − ds ∀(s, v) ∈ E
0 ≤ yu→v + dv − du ∀(u, v) ∈ E(G \ {s, t})
0 ≤ yv→t + dt − dv ∀(v, t) ∈ E
yu→v ≥ 0 ∀(u, v) ∈ E
ds = 1, dt = 0
¬ We could have done this directly, treating the two inequalities as equality, and multiplying it by a single variable that can be
both positive and negative – however, it is useful to see why this is correct at least once.

149
Which simplifies to the following LP:
Õ
min c(u → v) yu→v
(u,v)∈E

du − dv ≤ yu→v ∀(u, v) ∈ E
yu→v ≥ 0 ∀(u, v) ∈ E
ds = 1, dt = 0.

The above LP can be interpreted as follows: We are assigning weights to the edges (i.e., y(u,v) ). Given such
an assignment, it is easy to verify that setting du (for all u) to be the shortest path distance under this weighting
to the sink t, complies with all inequalities, the assignment ds = 1 implies that we require that the shortest path
distance from the source to the sink has length exactly one.
We are next going to argue that the optimal solution to this LP is a min-cut. Lets us first start with the
other direction, given a cut (S,T) with s ∈ S and t ∈ T, observe that setting

du = 1 ∀u ∈ S
du = 0 ∀u ∈ T
yu→v = 1 ∀(u, v) ∈ (S,T)
yu→v = 0 ∀(u, v) ∈ E \ (S,T)

is a valid solution for the LP.


As for the other direction, consider the optimal solution for the LP, and let its target function value be
Õ
α∗ = ∗
c(u → v) yu→v
(u,v)∈E

(we use (*) notation to the denote the values of the variables in the optimal LP n solution).
o Considern generating o
a cut as follows, we pick a random value uniformly in z ∈ [0, 1], and we set S = u du ≥ z and T = u du∗ < z .

This is a valid cut, as s ∈ S (as ds∗ = 1) and t ∈ T (as dt∗ = 0). Furthermore, an edge (u, v) is in the cut, only if
du∗ > dv∗ (otherwise, it is not possible to cut this edge using this approach).
In particular, the probability of u ∈ S and v ∈ T, is exactly du∗ − dv∗ ! Indeed, it is the probability that z falls
inside the interval [dv∗, du∗ ]. As such, (u, v) is in the cut with probability du∗ − dv∗ (again, only if du∗ > dv∗ ), which

is bounded by y(u,v) (by the inequality du − dv ≤ yu→v in the LP).
So, let Xu→v be an indicator variable which is one if the edge is in the generated cu. We just argued that

E[Xu→v ] = P[Xu→v = 1] ≤ y(u,v) . We thus have that the expected cost of this random cut is

 Õ  Õ Õ

Xu→v c(u → v) = = α∗ .
 
E  c(u → v) E[Xu→v ] ≤ c(u → v)yu→v
(u,v)∈E  (u,v)∈E (u,v)∈E
 
That is, the expected cost of a random cut here is at most the value of the LP optimal solution. In particular,
there must be a cut that has cost at most α∗ , see Remark 21.5.2 below. However, we argued that α∗ is no larger
than the cost of any cut. We conclude that α∗ is the cost of the min cut.
We are now ready for the kill, the optimal value of the original max-flow LP; that is, the max-flow (which is
a finite number because all the capacities are bounded numbers), is equal by the strong duality theorem, to the
optimal value of the dual LP (i.e., α∗ ). We just argued that α∗ is the cost of the min cut in the given network.
As such, we proved the following.

Lemma 21.5.1. The Min-Cut Max-Flow Theorem follows from the strong duality Theorem for Linear Pro-
gramming.

150
Remark 21.5.2. In the above, we used the following “trivial” but powerful argument. Assume you have a random
variable Z, and consider its expectation µ = E[Z]. The expectation µ is the weighted average value of the values
the random variable Z might have, and in particular, there must be a value z that might be assigned to Z (with
non-zero probability), such that z ≤ µ. Putting it differently, the weighted average of a set of numbers is bigger
(formally, no smaller) than some number in this set.
This argument is one of the standard tools in the probabilistic method – a technique to prove the existence
of entities by considering expectations and probabilities.

21.6. Solving LPs without ever getting into a loop - symbolic perturbations
21.6.1. The problem and the basic idea
Consider the following LP:
Õ
max z=v+ cj x j,
j ∈N
Õ
s.t. xi = bi − ai j x j for i = 1, . . . , n,
j ∈N

xi ≥ 0, ∀i = 1, . . . , n + m.

(Here B = {1, . . . , n} and N = {n + 1, . . . , n + m}.) The Simplex algorithm might get stuck in a loop of pivoting
steps, if one of the constants bi becomes zero during the algorithm execution. To avoid this, we are going to add
tiny infinitesimals to all the equations. Specifically, let ε > 0 be an arbitrarily small constant, and let εi = ε i .
The quantities ε1, . . . , εn are infinitesimals of different scales. We slightly perturb the above LP by adding them
to each equation. We get the following modified LP:
Õ
max z = v + cj x j,
j ∈N
Õ
s.t. xi = εi + bi − ai j x j for i = 1, . . . , n,
j ∈N

xi ≥ 0, ∀i = 1, . . . , n + m.

Importantly, any feasible solution to the original LP translates into a valid solution of this LP (we made things
better by adding these symbolic constants).
The rule of the game is now that we treat ε1, . . . , εn as symbolic constants. Of course, when we do pivoting,
we need to be able to compare two numbers and decide which one is bigger. Formally, given two numbers

α = α0 + α1 ε1 + · · · + αn εn and β = β0 + β1 ε1 + · · · + βn εn, (21.3)

then α > β if and only if there is an index i such that α0 = β0, α1 = β1, . . . , αi−1 = βi−1 and αi > βi . That is,
α > β if the vector (α0, α1, . . . , αn ) is lexicographically larger than (β0, β1, . . . , βn ).
Significantly, but not obviously at this stage, the simplex algorithm would never divide an εi by an ε j , so
we are good to go – we can perform all the needed arithmetic operations of the Simplex using these symbolic
constants, and we claim that now the constant term (which is a number of the form of Eq. (21.3)) is now never
zero. This implies immediately that the Simplex algorithm always makes progress, and it does terminates. We
still need to address the two issues:
(A) How are the symbolic perturbations are updated at each iteration?
(B) Why the constants can never be zero?

151
21.6.2. Pivoting as a Gauss elimination step
Consider the LP equations
Õ
xi + ai j x j = bi , for i ∈ B,
j ∈N

where B = {1, . . . , n} and N = {n + 1, . . . , n + m}. We can write these equations down in matrix form

x1 x2 ... xn xn+1 xn+2 ... xj ... xn+m const


1 0 ... 0 a1,n+1 a1,n+2 . . . a1, j . . . a1,n+m b1
0 1 ... 0 a2,n+1 a2,n+2 . . . a2, j . . . a2,n+m b2
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
..
0 0...0 1 0 . . . 0 ak,n+1 ak,n+2 . . . ak, j . ak,n+m bk
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
0 ... 0 1 an,n+1 an,n+2 . . . an, j . . . an,n+m bn

Assume that now we do a pivoting step with x j entering the basic variables, and xk leaving. To this end, let us
multiply the kth row (i.e., the kth equation) by 1/ak, j , this result in the kth row having 1 instead of ak, j . Let
this resulting row be denoted by r. Now, add ai, j r to the ith row of the matrix, for all i. Clearly, in the resulting
row/equation, the coefficient of x j is going to be zero, in all rows except the kth one, where it is 1. Note, that
on the matrix on the left side, all the columns are the same, except for the kth column, which might now have
various numbers in this column. The final step is to exchange the kth column on the left, with the jth column
on the right. And that is one pivoting step, when working on the LP using a matrix. It is very similar to one
step of the Gauss elimination in matrices, if you are familiar with that.

21.6.2.1. Back to the perturbation scheme


We now add a new matrix to the above representations on the right side, that keeps track of the εs. This looks
initially as follows.

x1 x2 ... xn xn+1 ... xj ... xn+m const ε1 ε2 . . . εn


1 0 ... 0 a1,n+1 . . . a1, j