0% found this document useful (0 votes)

237 views443 pages

Algorithms For Memory Hierarchies

This document provides an introduction and summary of the book "Algorithms for Memory Hierarchies". The book contains 16 chapters that introduce algorithmic techniques for achieving high performance on memory hierarchies. It focuses on methods that are interesting from both a practical and theoretical perspective. The chapters cover topics such as data structures, graph algorithms, geometric algorithms, text indexing, cache-oblivious algorithms, and parallel algorithms. Most chapters have a tutorial character and provide overviews of the key concepts within their topics.

Uploaded by

Mickael Schwedler

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

237 views443 pages

Algorithms For Memory Hierarchies

Uploaded by

Mickael Schwedler

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 443

Lecture Notes in Computer Science 2625

Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

3
Berlin
Heidelberg
New York
Barcelona
Hong Kong
London
Milan
Paris
Tokyo
Ulrich Meyer Peter Sanders Jop Sibeyn (Eds.)

Algorithms for
Memory Hierarchies

Advanced Lectures

13
Series Editors

Gerhard Goos, Karlsruhe University, Germany

Juris Hartmanis, Cornell University, NY, USA
Jan van Leeuwen, Utrecht University, The Netherlands

Volume Editors

Ulrich Meyer
Peter Sanders
Max-Planck-Institut für Informatik
Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany
E-mail: {umeyer,sanders}@mpi-sb.mpg.de
Jop Sibeyn
Martin-Luther-Universität Halle-Wittenberg, Institut für Informatik
Von-Seckendorff-Platz 1, 06120 Halle, Germany
E-mail:[email protected]
Cataloging-in-Publication Data applied for

A catalog record for this book is available from the Library of Congress.

Bibliographic information published by Die Deutsche Bibliothek.

Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>.

CR Subject Classification (1998): F.2, E.5, E.1, E.2, D.2, D.4, C.2, G.2, H.2, I.2, I.3.5

ISSN 0302-9743
ISBN 3-540-00883-7 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
http://www.springer.de
© Springer-Verlag Berlin Heidelberg 2003
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Boller Mediendesign
Printed on acid-free paper SPIN: 10873015 06/3142 543210
Preface

Algorithms that process large data sets have to take into account that the
cost of memory accesses depends on where the accessed data is stored. Tradi-
tional algorithm design is based on the von Neumann model which assumes
uniform memory access costs. Actual machines increasingly deviate from this
model. While waiting for a memory access, modern microprocessors can exe-
cute 1000 additions of registers. For hard disk accesses this factor can reach
seven orders of magnitude. The 16 chapters of this volume introduce and
survey algorithmic techniques used to achieve high performance on memory
hierarchies. The focus is on methods that are interesting both from a practical
and from a theoretical point of view.
This volume is the result of a GI-Dagstuhl Research Seminar. The Ge-
sellschaft für Informatik (GI) has organized such seminars since 1997. They
can be described as “self-taught” summer schools where graduate students
in cooperation with a few more experienced researchers have an opportunity
to acquire knowledge about a current topic of computer science. The seminar
was organized as Dagstuhl Seminar 02112 from March 10, 2002 to March
14, 2002 in the International Conference and Research Center for Computer
Science at Schloss Dagstuhl.
Chapter 1 gives a more detailed motivation for the importance of al-
gorithm design for memory hierarchies and introduces the models used in
this volume. Interestingly, the simplest model variant — two levels of mem-
ory with a single processor — is sufficient for most algorithms in this book.
Chapters 1–7 represent much of the algorithmic core of external memory
algorithms and almost exclusively rely on this simple model. Among these,
Chaps. 1–3 lay the foundations by describing techniques used in more spe-
cific applications. Rasmus Pagh discusses data structures like search trees,
hash tables, and priority queues in Chap. 2. Anil Maheshwari and Norbert
Zeh explain generic algorithmic approaches in Chap. 3. Many of these tech-
niques such as time-forward processing, Euler tours, or list ranking can be
formulated in terms of graph theoretic concepts. Together with Chaps. 4 and
5 this offers a comprehensive review of external graph algorithms. Irit Ka-
triel and Ulrich Meyer discuss fundamental algorithms for graph traversal,
shortest paths, and spanning trees that work for many types of graphs. Since
even simple graph problems can be difficult to solve in external memory, it
VI Preface

Models

Basics
Data Structures

Algorithms
Graphs
Techniques
Graphs
Special Graphs

mainly tutorial character

Geometry
Text Indexes

Caches
Caches
Cache−Oblivious
Numerics
AI
Applications

Storage Networks
Systems
Parallelism

File Systems
Databases
Parallel Models
Parallel Sorting

makes sense to look for better algorithms for frequently occurring special
types of graphs. Laura Toma and Norbert Zeh present a number of aston-
ishing techniques that work well for planar graphs and graphs with bounded
tree width.
In Chap. 6 Christian Breimann and Jan Vahrenhold give a comprehensive
overview of algorithms and data structures handling geometric objects like
points and lines — an area that is at least as rich as graph algorithms. A
third area of again quite different algorithmic techniques are string problems
discussed by Juha Kärkäinen and Srinivasa Rao in Chap. 7.
Chapters 8–10 then turn to more detailed models with particular empha-
sis on the complications introduced by hardware caches. Beyond this common
motivation, these chapters are quite diverse. Naila Rahman uses sorting as an
example for these issues in Chap. 8 and puts particular emphasis on the of-
ten neglected issue of TLB misses. Piyush Kumar introduces cache-oblivious
algorithms in Chap. 9 that promise to grasp multilevel hierarchies within a
very simple model. Markus Kowarschik and Christian Weiß give a practical
introduction into cache-efficient programs using numerical algorithms as an
example. Numerical applications are particularly important because they al-
low significant instruction-level parallelism so that slow memory accesses can
dramatically slow down processing.
Stefan Edelkamp introduces an application area of very different char-
acter in Chap. 11. In artificial intelligence, search programs have to handle
huge state spaces that require sophisticated techniques for representing and
traversing them.
Chapters 12–14 give a system-oriented view of advanced memory hierar-
chies. On the lowest level we have storage networks connecting a large num-
ber of inhomogeneous disks. Kay Salzwedel discusses this area with particular
Preface VII

emphasis on the aspect of inhomogeneity. File systems give a more abstract

view of these devices on the operating system level. Florin Isaila explains the
organization of modern file systems in Chap. 13. An even higher level view is
offered by relational database systems. Josep Larriba-Pey explains their or-
ganization in Chap. 14. Both in file systems and databases, basic algorithmic
techniques like sorting and search trees turn out to be relevant.
Finally, Chaps. 15 and 16 give a glimpse on memory hierarchies with
multiple processors. Massimo Coppola and Martin Schmollinger introduce
abstract and concrete programming models like BSP and MPI in Chap. 15.
Dani Jimenez, Josep-L. Larriba, and Juan J. Navarro present a concrete case
study of sorting algorithms on shared memory machines in Chap. 16. He
studies programming techniques that avoid pitfalls like true and false sharing
of cache contents.
Most chapters in this volume have partly tutorial character and are partly
more dense overviews. At a minimum Chaps. 1, 2, 3, 4, 9, 10, 14, and 16
are tutorial chapters suitable for beginning graduate-level students. They are
sufficiently self-contained to be used for the core of a course on external mem-
ory algorithms. Augmented with the other chapters and additional papers it
should be possible to shape various advanced courses. Chapters 1–3 lay the
basis for the remaining chapters that are largely independent.
We are indebted to many people and institutions. We name a few in al-
phabetical order. Ulrik Brandes helped with sources from a tutorial volume
on graph drawing that was our model in several aspects. The International
Conference and Research Center for Computer Science in Dagstuhl provided
its affordable conference facilities and its unique atmosphere. Springer-Verlag,
and in particular Alfred Hofmann, made it possible to smoothly publish the
volume in the LNCS series. Kurt Mehlhorn’s group at MPI Informatik pro-
vided funding for several (also external) participants. Dorothea Wagner came
up with the idea for the seminar and advised us in many ways. This volume
was also partially supported by the Future and Emerging Technologies pro-
gramme of the EU under contract number IST-1999-14186 (ALCOM-FT).

January 2003 Ulrich Meyer

Peter Sanders
Jop Sibeyn
Table of Contents

1. Memory Hierarchies — Models and Lower Bounds

Peter Sanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Why Memory Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Current Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Issues in External Memory Algorithm Design . . . . . . . . . . . . . . 9
1.5 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2. Basic External Memory Data Structures

Rasmus Pagh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Elementary Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 B-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Hashing Based Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Dynamization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3. A Survey of Techniques for Designing I/O-Eﬃcient

Algorithms
Anil Maheshwari and Norbert Zeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Basic Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Simulation of Parallel Algorithms in External Memory . . . . . . 44
3.4 Time-Forward Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Greedy Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 List Ranking and the Euler Tour Technique . . . . . . . . . . . . . . . . 50
3.7 Graph Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4. Elementary Graph Algorithms in External Memory

Irit Katriel and Ulrich Meyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Graph-Traversal Problems: BFS, DFS, SSSP . . . . . . . . . . . . . . . 63
X Table of Contents

4.3 Undirected Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 I/O-Eﬃcient Tournament Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Undirected SSSP with Tournament Trees . . . . . . . . . . . . . . . . . . 73
4.6 Graph-Traversal in Directed Graphs . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Conclusions and Open Problems for Graph Traversal . . . . . . . . 76
4.8 Graph Connectivity: Undirected CC, BCC, and MSF . . . . . . . 77
4.9 Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.10 Minimum Spanning Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.11 Randomized CC and M SF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.12 Biconnected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.13 Conclusion for Graph Connectivity . . . . . . . . . . . . . . . . . . . . . . . 84

5. I/O-Eﬃcient Algorithms for Sparse Graphs

Laura Toma and Norbert Zeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Deﬁnitions and Graph Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Connectivity Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Breadth-First Search and Single Source Shortest Paths . . . . . . 93
5.6 Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Graph Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.8 Gathering Structural Information . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9 Conclusions and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6. External Memory Computational Geometry Revisited

Christian Breimann and Jan Vahrenhold . . . . . . . . . . . . . . . . . . . . . . . 110
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 General Methods for Solving Geometric Problems . . . . . . . . . . 112
6.3 Problems Involving Sets of Points . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Problems Involving Sets of Line Segments . . . . . . . . . . . . . . . . . 131
6.5 Problems Involving Set of Polygonal Objects . . . . . . . . . . . . . . . 144
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7. Full-Text Indexes in External Memory

Juha Kärkkäinen and S. Srinivasa Rao . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3 Basic Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4 I/O-eﬃcient Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.5 External Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Table of Contents XI

8. Algorithms for Hardware Caches and TLB

Naila Rahman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.2 Caches and TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.3 Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.4 Algorithms for Internal Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.5 Cache Misses and Power Consumption . . . . . . . . . . . . . . . . . . . . 185
8.6 Exploiting Other Memory Models: Advantages and
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.7 Sorting Integers in Internal Memory . . . . . . . . . . . . . . . . . . . . . . 189

9. Cache Oblivious Algorithms

Piyush Kumar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.3 Algorithm Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.4 Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.5 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6 Searching Using Van Emde Boas Layout . . . . . . . . . . . . . . . . . . . 203
9.7 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.8 Is the Model an Oversimpliﬁcation? . . . . . . . . . . . . . . . . . . . . . . . 209
9.9 Other Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.10 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.11 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

10. An Overview of Cache Optimization Techniques and

Cache-Aware Numerical Algorithms
Markus Kowarschik and Christian Weiß . . . . . . . . . . . . . . . . . . . . . . . . 213
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.2 Architecture and Performance Evaluation of Caches . . . . . . . . . 214
10.3 Basic Techniques for Improving Cache Eﬃciency . . . . . . . . . . . 217
10.4 Cache-Aware Algorithms of Numerical Linear Algebra . . . . . . 225
10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

11. Memory Limitations in Artiﬁcial Intelligence

Stefan Edelkamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
11.2 Hierarchical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
11.3 Single-Agent Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
11.4 Action Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.5 Game Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.6 Other AI Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
XII Table of Contents

12. Algorithmic Approaches for Storage Networks

Kay A. Salzwedel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
12.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
12.3 Space and Access Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
12.4 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
12.5 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
12.6 Adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

13. An Overview of File System Architectures

Florin Isaila . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.2 File Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
13.3 File System Duties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.4 Distributed File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

14. Exploitation of the Memory Hierarchy in Relational

DBMSs
Josep-L. Larriba-Pey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
14.2 What to Expect and What Is Assumed . . . . . . . . . . . . . . . . . . . 291
14.3 DBMS Engine Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
14.4 Evidences of Locality in Database Workloads . . . . . . . . . . . . . . 297
14.5 Basic Techniques for Locality Exploitation . . . . . . . . . . . . . . . . . 298
14.6 Exploitation of Locality by the Executor . . . . . . . . . . . . . . . . . . 300
14.7 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
14.8 Exploitation of Locality by the Buﬀer Pool Manager . . . . . . . 311
14.9 Hardware Related Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
14.10 Compilation for Locality Exploitation . . . . . . . . . . . . . . . . . . . . 318
14.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

15. Hierarchical Models and Software Tools for Parallel

Programming
Massimo Coppola and Martin Schmollinger . . . . . . . . . . . . . . . . . . . . 320
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
15.2 Architectural Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15.3 Parallel Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
15.4 Parallel Bridging Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
15.5 Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
15.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Table of Contents XIII

16. Case Study: Memory Conscious Parallel Sorting

Dani Jiménez-González, Josep-L. Larriba-Pey, and
Juan J. Navarro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
16.2 Architectural Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
16.3 Sequential and Straight Forward Radix Sort Algorithms . . . . . 361
16.4 Memory Conscious Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
16.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
16.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
List of Contributors

Editors
Ulrich Meyer Jop F. Sibeyn
Max-Planck Institut für Informatik Martin-Luther Universität
Stuhlsatzenhausweg 85 Halle-Wittenberg
66123 Saarbrücken, Germany Institut für Informatik
[email protected] Von-Seckendorﬀ-Platz 1
06120 Halle, Germany
Peter Sanders [email protected]
Max-Planck Institut für Informatik
Stuhlsatzenhausweg 85
66123 Saarbrücken, Germany
[email protected]

Authors
Christian Breimann Stefan Edelkamp
Westfälische Wilhelms-Universität Albert-Ludwigs-Universität Freiburg
Institut für Informatik Institut für Informatik
Einsteinstr. 62 Georges-Köhler-Allee, Gebäude 51
48149 Münster, Germany 79110 Freiburg, Germany
[email protected] [email protected]

Massimo Coppola Florin Isaila

University of Pisa University of Karlsruhe
Department of Computer Science Department of Computer Science
Via F. Buonarroti 2 PO-Box 6980
56127 Pisa, Italy 76128 Karlsruhe, Germany
[email protected] [email protected]
XVI List of Contributors

Dani Jiménez-González Anil Maheshwari

Universitat Politècnica de Catalunya Carleton University
Computer Architecture Department School of Computer Science
Jordi Girona 1-3, Campus Nord-UPC 1125 Colonel By Drive
E-08034 Barcelona, Spain Ottawa, Ontario, K1S 5B6, Canada
[email protected] [email protected]

Irit Katriel Ulrich Meyer

Max-Planck Institut für Informatik Max-Planck Institut für Informatik
Stuhlsatzenhausweg 85 Stuhlsatzenhausweg 85
66123 Saarbrücken, Germany 66123 Saarbrücken, Germany
[email protected] [email protected]

Juha Kärkkäinen Juan J. Navarro

Max-Planck Institut für Informatik Universitat Politècnica de Catalunya
Stuhlsatzenhausweg 85 Computer Architecture Department
66123 Saarbrücken, Germany Jordi Girona 1-3, Campus Nord-UPC
[email protected] E-08034 Barcelona, Spain
[email protected]
Markus Kowarschik
Friedrich–Alexander–Universität Rasmus Pagh
Erlangen–Nürnberg The IT University of Copenhagen
Lehrstuhl für Informatik 10 Glentevej 67
Cauerstraße 6, 2400 København NV, Denmark
91058 Erlangen, Germany [email protected]
[email protected]
Naila Rahman
Piyush Kumar University of Leicester
State University of New York Department of Mathematics
at Stony Brook and Computer Science
Department of Computer Science University Road
Stony Brook, NY 11790, USA Leicester, LE1 7RH, U. K.
[email protected] [email protected]

Josep-L. Larriba-Pey S. Srinivasa Rao

Universitat Politècnica de Catalunya University of Waterloo
Computer Architecture Department School of Computer Science
Jordi Girona 1-3, Campus Nord-UPC 200 University Avenue West
E-08034 Barcelona, Spain Waterloo, Ontario, N2L 3G1, Canada
[email protected] [email protected]
List of Contributors XVII

Kay A. Salzwedel Jan Vahrenhold

Universität Paderborn Westfälische Wilhelms-Universität
Heinz Nixdorf Institut Institut für Informatik
Fürstenallee 11 Einsteinstr. 62
33102 Paderborn, Germany 48149 Münster, Germany
[email protected] [email protected]

Peter Sanders Christian Weiß

Max-Planck Institut für Informatik Technische Universität München
Stuhlsatzenhausweg 85 Lehrstuhl für Rechnertechnik und
66123 Saarbrücken, Germany Rechnerorganisation
[email protected] Boltzmannstr. 3
85748 München, Germany
Martin Schmollinger [email protected]
Universität Tübingen
Wilhelm-Schickard Institut Norbert Zeh
für Informatik, Sand 14 Duke University
72076 Tübingen, Germany Department of Computer Science
martin.schmollinger Durham, NC 27708, USA
@informatik.uni-tuebingen.de [email protected]

Laura Toma
Duke University
Department of Computer Science
Durham, NC 27708, USA
[email protected]
1. Memory Hierarchies —
Models and Lower Bounds
Peter Sanders∗

The purpose of this introductory chapter is twofold. On the one hand, it serves
the rather prosaic purpose of introducing the basic models and notations used
in the subsequent chapters. On the other hand, it explains why these simple
abstract models can be used to develop better algorithms for complex real
world hardware.
Section 1.1 starts with a basic motivation for memory hierarchies and
Section 1.2 gives a glimpse on their current and future technological realiza-
tions. More theoretically inclined readers can skip or skim this section and
directly proceed to the introduction of the much simpler abstract models in
Section 1.3. Then we have all the terminology in place to explain the guiding
principles behind algorithm design for memory hierarchies in Section 1.4. A
further issue permeating most external memory algorithms is the existence of
fundamental lower bounds on I/O complexity described in Section 1.5. Less
theoretically inclined readers can skip the proofs but might want to remember
these bounds because they show up again and again in later chapters.
Parallelism is another important approach to high performance comput-
ing that has many interactions with memory hierarchy issues. We describe
parallelism issues in subsections that can be skipped by readers only inter-
ested in sequential memory hierarchies.

1.1 Why Memory Hierarchies

There is a wide spectrum of computer applications that can make use of

arbitrarily large amounts of memory. For example, consider geographic infor-
mation systems. NASA measures the data volumes from satellite images in
petabytes (1015 bytes). Similar ﬁgures are given by climate research centers
and particle physicists [555].
Although it is unlikely that all these informations will be needed in a single
application, we can easily come to huge data sets. For example, consider a
map of the world that associates 32 bits with each square meter of a continent
— something technologically quite feasible with modern satellite imagery. We
would get a data set of about 600 terabytes.
Other examples of huge data sets are data warehouses of large compa-
nies that keep track of every single transaction, digital libraries for books,
images, and movies (a single image frame of high quality movie takes several
∗
Partially supported by the Future and Emerging Technologies programme of the
EU under contract number IST-1999-14186 (ALCOM-FT).

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 1-13, 2003.
© Springer-Verlag Berlin Heidelberg 2003
2 Peter Sanders

megabytes), or large scale numerical simulations. Even if the input and output
of an application are small, it might be necessary to store huge intermediate
data structures. For example, some of the state space search algorithms in
Chapter 11 are of this type.
How should a machine for processing such large inputs look? Data should
be cheap to store but we also want fast processing. Unfortunately, there are
fundamental reasons why we cannot get memory that is at the same time
cheap, compact, and fast. For example, no signal can propagate faster than
light. Hence, given a storage technology and a desired access latency, there is
only a ﬁnite amount of data reachable within this time limit. Furthermore, in
a cheap and compact storage technology there is no room for wires reaching
every single memory cell. It is more economical to use a small number of
devices that can be moved to access a given bit.
There are several approaches to escape this so called memory wall prob-
lem. The simplest and most widely used compromise is a memory hierarchy.
There are several categories of memory in a computer ranging from small and
fast to large, cheap, and slow. Even in a memory hierarchy, we can process
huge data sets eﬃciently. The reason is that although access latencies to the
huge data sets are large, we can still achieve large bandwidths by accessing
many close-by bits together and by using several memory units in parallel.
Both approaches can be modeled using the same abstract view: Access to
large blocks of memory is almost as fast as access to a single bit. The algo-
rithmic challenge following from this principle is to design algorithms that
perform well on systems with blocked memory access. This is the main sub-
ject of this volume.

1.2 Current Technology

Although we view memory hierarchies as something fundamental, it is in-
structive to look at the way memory hierarchies are currently designed and
how they are expected to change in the near future. More details and ex-
planations can be found in the still reasonably up to date textbook [392]. A
valuable and up-to-date introductory source is the web page on PC Technol-
ogy http://www.pctechguide.com/.
Currently, a high performance microprocessor has a ﬁle of registers that
have multiple ports so that several accesses can be made in parallel. For
example, twelve parallel accesses must be supported by a superscalar machine
that executes up to four instructions per clock cycle each of which addresses
three registers.
Since multiple ports require too much chip area per bit, the ﬁrst level (L1)
cache supports only one or two accesses per clock. Each such access already
incurs a delay of a few clock cycles since additional stages of the instruction
processing pipelines have to be traversed. L1 cache is usually only a few
kilobytes large because a larger area would incur longer connections and
1. Memory Hierarchies — Models and Lower Bounds 3

hence even larger access latencies [399]. Often there are separate L1 caches
for instructions and data.
The second level (L2) cache is on the same chip as the first level cache but
it has quite different properties. The L2 cache is as large as the technology
allows because applications that fit most of their data into this cache can
execute very fast. The L2 cache has access latencies around ten clock cycles.
Communication between L1 and L2 cache uses block sizes of 16–32 bytes. For
accessing off-chip data, larger blocks are used. For example, the Pentium 4
uses 128 byte blocks [399].
Some processors have a third level (L3) cache that is on a separate set of
chips. This cache is made out of fast static1 RAM cells. The L3 cache can
be very large in principle, but this is not always cost effective because static
RAMs are rather expensive.
The main memory is made out of high density cheap dynamic RAM
cells. Since the access speeds of dynamic RAMs have lagged behind processor
speeds, dynamic RAMs have developed into devices optimized for block ac-
cess. For example, RAMBUS RDRAM2 chips allow blocks of up to 16 bytes
to be accessed in only twice the time to access a single byte.
The programmer is not required to know about the details of the hierarchy
between caches and main memory. The hardware cuts the main memory into
blocks of fixed size and automatically maps a subset of the memory blocks
to L3 cache. Furthermore, it automatically maps a subset of the blocks in L3
cache to L2 cache and from L2 cache to L1 cache. Although this automatic
cache administration is convenient and often works well, one is up to un-
pleasant surprises. In Chapter 8 we will see that sometimes a careful manual
mapping of data to the memory hierarchy would work much better.
The backbone of current data storage are magnetic hard disks because
they offer cheap non volatile memory [643]. In the last years, extremely high
densities have been achieved for magnetic surfaces that allow several giga-
bytes to be stored on the area of a postage stamp. The data is accessed by
tiny magnetic devices that hover as low as 20 nm over the surface of the
rotating disk. It takes very long to move the access head to a particular track
of the disk and to wait until the disk rotates into the correct position. With
up to 10 ms, disk access can be 107 times slower than an access to a register.
However, once the head starts reading or writing, data can be transferred at
a rate of about 50 megabytes per second. Hence, accessing hundreds of KB
takes only about twice as long as accessing a single byte. Clearly, it makes
sense to process data in large chunks.
Hard disks are also used as a way to virtually enlarge the main mem-
ory. Logical blocks that are currently not in use are swapped to disk. This
mechanism is partially supported by the processor hardware that is able to
1
Static RAM needs six transistors per bit which makes it more area consuming
but faster than dynamic RAM that needs only one transistor per bit.
2
http://www.rambus.com
4 Peter Sanders

automatically translate between logical memory addresses and physical mem-

ory addresses. This translation uses yet another small cache, the translation
lookaside buﬀer (TLB)
There is a ﬁnal level of memory hierarchy used for backups and archiving
of data. Magnetic tapes and optical disks allow even cheaper storage of data
but have a very high access latency ranging from seconds to minutes because
the media have to be retrieved from a shelf and mounted on some access
device.

Current and Future Developments

There are too many possible developments to explain or even perceive all of
them in detail but a few basic trends should be noted. The memory hierarchy
might become even deeper. Third level caches will become more common.
Intel has even integrated it on the Itanium 2 processor. In such a system,
an off-chip 4th level cache makes sense. There is also a growing gap between
the access latencies and capacities of disks and main memory. Therefore,
magnetic storage devices with smaller capacity but also lower access latency
have been proposed [669].
While storage density in CMOS-RAMs and magnetic disks will keep in-
creasing for quite some time, it is conceivable that different technologies will
get their chance in a longer time frame. There are some ideas available that
would allow memory cells consisting of single molecules [780]. Furthermore,
even with current densities, astronomically large amounts of data could be
stored using three-dimensional storage devices. The main difficulty is how to
write and read such memories. One approach uses holographic images stor-
ing large blocks of data in small three-dimensional regions of a transparent
material [716].
Regardless of the technology, it seems likely that block-wise access and
the use of parallelism will remain necessary to achieve high performance
processing of large volumes of data.

Parallelism

A more radical change in the model is explicit parallel processing. Although

this idea is not so new, there are several reasons why it might have increased
impact in the near future. Microprocessors like the Intel Xeon first delivered
in 2002 have multiple register sets and are able to execute a corresponding
number of threads of activity in parallel. These threads share the same ex-
ecution pipeline. Their accumulated performance can be significantly higher
than the performance of a single thread with exclusive access to the pro-
cessing resources. One main reason is that while one thread is waiting for a
memory access to finish, another thread can use the processor. Parallelism
spreads in many other respects. Several processors on the same chip can share
1. Memory Hierarchies — Models and Lower Bounds 5

a main memory and a second level cache. The IBM Power 4 processor already
implements this technology. Several processors on diﬀerent chips can share
main memory. Several processor boards can share the same network of disks.
Servers usually have many disk drives. In such systems, it becomes more and
more important that memory devices on all levels of the memory hierarchy
can work on multiple memory accesses in parallel.
On parallel machines, some levels of the memory hierarchy may be shared
whereas others are distributed between the processors. Local caches may hold
copies of shared or remote data. Thus, a read access to shared data may be
as fast as a local access. However, writing shared data invalidates all the
copies that are not in the cache of the writing processor. This can cause
severe overhead for sending the invalidations and for reloading the data at
subsequent remote accesses.

1.3 Modeling
We have seen that real memory hierarchies are very complex. We have mul-
tiple levels, all with their own idiosyncrasies. Hardware caches have replace-
ment strategies that vary between simplistic and strange [294], disks have
position dependent access delays, etc. It might seem that the best models are
those that are as accurate as possible. However, for algorithm design, this
leads the wrong way. Complicated models make algorithms difficult to design
and analyze. Even if we overcome these differences, it would be very difficult
to interpret the results because complicated models have a lot of parameters
that vary from machine to machine.
Attractive models for algorithm design are very simple, so that it is easy to
develop algorithms. They have few parameters so that it is easy to compare
the performance of algorithms. The main issue in model design is to find
simple models that grasp the essence of the real situation so that algorithms
that are good in the model are also good in reality.
In this volume, we build on the most widely used nonhierarchical model.
In the random access machine (RAM) model or von Neumann model [579],
we have a “sufficiently” large uniform memory storing words of size O(log n)
bits where n is the size of our input. Accessing any word in memory takes con-
stant time. Arithmetics and bitwise operations with words can be performed
in constant time. For numerical and geometric algorithms, it is sometimes
also assumed that words can represent real numbers accurately. Storage con-
sumption is measured in words if not otherwise mentioned.
Most chapters of this volume use a minimalistic extension that we will
simply call the external memory model. We use the notation introduced by
Aggarwal, Vitter, and Shriver [17, 755]. Processing works almost as in the
RAM model, except that there are only M words of internal memory that
can be accessed quickly. The remaining memory can only be accessed using
I/Os that move B contiguous words between internal and external memory.
6 Peter Sanders

CPU
Fast Memory
M
B

Large Memory

Fig. 1.1. The external memory model.

Figure 1.1 depicts this arrangement. To analyze an external memory algo-

rithm, we count the number of I/Os needed in addition to the time that
would be needed on a RAM machine.
Why is such a simple model adequate to describe something as complex as
memory hierarchies? The easiest justification would be to lean on authority.
Hundreds of papers using this model have been published, many of them in
top conferences and journals. Many external memory algorithms developed
are successfully used in practice. Vitter [754] gives an extensive overview.
But why is this model so successful? Although the word “I/O” suggests that
external memory should be identified with disk memory, we are free to choose
any two levels of the memory hierarchy for internal and external memory in
the model. Inaccuracies of the model are usually limited by reasonable con-
stant factors. This claim needs further explanation. The main problem with
hardware caches is that they use a fixed simplistic strategy for deciding which
blocks are kept whereas the external memory model gives the programmer
full control over the content of internal memory. Although this difference can
have devastating effects, it rarely happens in practice. Mehlhorn and Sanders
[543] give an explanation of this effect for a large class of cache access pat-
terns. Sen and Chatterjee [685] and Frigo et al. [321] have observed that in
principle we can even circumvent hardware replacement schemes and take
explicit control of cache content.
Hard disks are even more complicated than caches [643] but again inac-
curacies of the external memory model are not as big as one might think:
Disks have their own local caches. But these are so small that for algorithms
that process really large data sets they do not make a big difference. Roughly
speaking, the disk access time consists of a latency needed to move the disk
head to the appropriate position and a transfer time that is proportional
to the amount of data transmitted. We are more or less free to choose this
amount of data and hence it is not accurate to only count the number of
accesses. However, if we fix the block size so that the transfer time is about
the same as the latency, we only make a small error. Let us explain this for
the (oversimplified) case that time t0 +B is needed to access B words of data.
1. Memory Hierarchies — Models and Lower Bounds 7

Then a good choice of the block size is B = t0 . When we access less data
we are at most a factor two off by accessing an entire block of size B. When
we access L > B words, we are at most a factor two off by counting L/B
block I/Os.
In reality, the access latency depends on the current position of the disk
mechanism and on the position of the block to be accessed on the disk.
Although exploiting this effect can make a big difference, programs that op-
timize access latencies are rare since the details depend on the actual disk
used and are usually not published by the disk vendors. If other applications
or the operating system make additional unpredictable accesses to the same
disk, even sophisticated optimizations can be in vain. In summary, by picking
an appropriate block size, we can model the most important aspects of disk
drives.

Parallelism

Although we mostly use the sequential variant of the external memory model,
it also has an option to express parallelism. External memory is partitioned
into D parts (e.g. disks) so that in each I/O step, one block can be accessed
on each of the parts.
With respect to parallel disks, the model of Vitter and Shriver [755] de-
viates from an earlier model by Aggarwal and Vitter [17] where D arbitrary
blocks can be accessed in parallel. A hardware realization could have D read-
ing/writing devices that access a single disk or a D-ported memory. This
model is more powerful because algorithms need not care about the map-
ping of data to disks. However, there are efficient (randomized) algorithms
for emulating the Aggarwal-Vitter model on the Vitter-Shriver model [656].
Hence, one approach to developing parallel disk external memory algorithms
is to start with an algorithm for the Aggarwal-Vitter model and then add an
appropriate load balancing algorithm (also called declustering).
Vitter and Shriver also make provisions for parallel processing. There are
P identical processors that can work in parallel. Each has fast memory M/P
and is equipped with D/P disks. In the external memory model there are
no additional parameters expressing the communication capabilities of the
processors. Although this is an oversimplification, this is already enough to
distinguish many algorithms with respect to their ability to be executed on
parallel machines. The model seems suitable for parallel machines with shared
memory.
For discussing parallel external memory on machines with distributed
memory we need a model for communication cost. The BSP model [742] that
is widely accepted for parallel (internal) processing fits well here: The P pro-
cessors work in supersteps. During a superstep, the processors can perform
local communications and post messages to other processors to the commu-
nication subsystem. At the end of a superstep, all processors synchronize and
exchange all the messages that have been posted during the superstep. This
8 Peter Sanders

synchronous communication takes time + gh where is the latency, g the

gap and h the maximum number of words a processor sends or receives in
this communication phase. The parameter models the overhead for syn-
chronizing the processors and the latency of messages traveling through the
network. If we assume our unit of time to be the time needed to execute one
instruction, the parameter g is the ratio between communication speed and
computation speed.

1.3.1 More Models

In Chapter 8 we will see more refined models for the fastest levels of the
memory hierarchy, including replacements strategies used by the hardware
and the role of the TLB. Chapter 10 contributes additional practical examples
from numeric computing. Chapter 15 will explain parallel models in more
detail. In particular, we will see models that take multiple levels of hierarchy
into account.
There are also alternative models for the simple sequential memory hier-
archy. For example, instead of counting block I/Os with respect to a block
size B, we could allow variable block sizes and count the number of I/Os
k and the total I/O volume h. The total I/O cost could then be accounted
as I/O k + gI/O v where — in analogy to the BSP model — I/O stands for
the I/O latency and gI/O for the ratio between I/O speed and computation
speed. This model is largely equivalent to the block based model but it might
be more elegant when used together with the BSP model and it is more ad-
equate to explain differences between algorithms with regular and irregular
access patterns [227].
Another interesting variant is the cache oblivious model discussed in
Chapter 9. This model is identical to the external memory model except
that the algorithm is not told the values of B and M . The consequence of
this seemingly innocent variant is that an I/O efficient cache oblivious al-
gorithm works well not only on any machine but also on all levels of the
memory hierarchy at the same time. Cache oblivious algorithms can be very
simple, i.e., we do not need to know B and M to scan an array. But even
cache oblivious sorting is quite difficult.
Finally, there are interesting approaches to eliminate memory hierarchies.
Blocked access is only one way to hide access latency. Another approach is
pipelining where many independent accesses are executed in parallel. This
approach is more powerful but also more difficult to support in hardware.
Vector computers such as the NEC SX-6 support pipelined memory access
even to nonadjacent cells at full memory bandwidth. Several experimental
machines [2, 38] use massive pipelined memory access by the hardware to
run many parallel threads on a single processor. While one thread waits for
a memory access, the other threads can do useful work. Modern mainstream
processors also support pipelined memory access to a certain extend [399].
1. Memory Hierarchies — Models and Lower Bounds 9

1.4 Issues in External Memory Algorithm Design

Before we look at particular algorithms in the rest of this volume, let us

first discuss the goals we should achieve by an external memory algorithm.
Ideally, the user should not notice the difference between external memory
and internal memory at all, i.e., the program should run as fast as if all the
memory would be internal memory. The following principles help:
Internal efficiency: The internal work done by the algorithm should be com-
parable to the best internal memory algorithms.
Spatial locality: When a block is accessed, it should contain as much useful
data as possible.
Temporal locality: Once data is in the internal memory, as much useful work
as possible should be done on it before it is written back to external
memory.
Which of these criteria is most important, depends a lot on the applica-
tion and on the hardware used. As usual in computer science, the overall
performance is mostly determined by the weakest link. Let us consider a pro-
totypical scenario. Assume we have a good internal memory algorithm for
some application. Now it turns out that we want to run it on much larger
inputs and internal memory will not suffice any more. The first try could be
to ignore this problem and see how the virtual memory capability of the op-
erating system deals with it. When this works, we are exceptionally lucky. If
we see very bad performance, this usually means that the existing algorithm
has poor locality. We may then apply the algorithmic techniques developed
in this volume to improve locality.
Several outcomes are possible. It may be that despite our effort, locality
remains the limiting factor. When discussing further improvements we will
then focus on locality and might even accept an increase of internal work.
But we should keep in mind that many algorithms do some useful work for
every word accessed, i.e., locality is quite good. If the application nevertheless
remains I/O-bound , this means that the I/O bandwidth of our system is low.
This is a common observation when researchers run their external memory
algorithms on workstations with a single disk and I/O interfaces not build for
high performance I/O. However, we should expect that serious applications of
external memory algorithms will run on hardware and software build for high
I/O performance. Let us consider a machine recently configured by Roman
Dementiev and the author as an example. The parts for this system cost
about 3000 Euro in July 2002, i.e., the price is in the range of an ordinary
workstation. The STREAM3 benchmark achieves a main memory bandwidth
of 1445MB/s on one of two 2.0 GHz Intel Xeon processors. Using eight disks
and four IDE controllers, we achieve an I/O bandwidth of up to 375 MB/s,
i.e., the bandwidth gap between main memory and disks is not very large.
3
http://www.streambench.org/
10 Peter Sanders

For example, our ﬁrst implementation of external memory sorting on this

machine used internal quicksort as a subroutine. For more than two disks
this internal sorting was the main bottleneck. Hence, internal eﬃciency is a
really important aspect of good external memory algorithms.

Parallelism

In parallel models, internal eﬃciency and locality is as important as in the

sequential case. In particular, temporal and spatial locality with respect to
the local memory of a processor is an issue.
An new issue is load balancing or declustering . All disks should access
useful data in most parallel I/O steps. All processors should have about the
same amount of work during a superstep in the BSP model, and no processor
should have to send or receive too much data at the end of a superstep.
When programming shared memory machines, the caching policies de-
scribed above must be taken into account. True sharing occurs when several
processors write to the same memory location. Such writes are expensive
since they amount to invalidation and reloading of entire cache blocks by
all other processors reading this location. Hence, parallel algorithms should
avoid frequent write accesses to shared data. Moreover, even write accesses
to diﬀerent memory cells might lead to the same sharing eﬀect if the cells
are located on the same block of memory. This phenomenon is called false
sharing. Chapter 16 studies true and false sharing in detail using sorting as
an example.

1.5 Lower Bounds

A large number of external memory algorithms can be assembled from
the three ingredients scanning, sorting, and searching. There are essentially
matching upper and lower bounds for the number of I/Os needed to perform
these operations:
Scanning: Look at the input once in the order it is stored. If N is the amount
of data to be inspected, we obviously need

scan(N ) = Θ(N/B) I/Os. (1.1)

Permuting and Sorting: Too often, the data is not arranged in a way that
scanning helps. Then we can rearrange the data into an order where scanning
is useful. When we already know where to place each elements, this means
permuting the data. When the permutation is deﬁned implicitly via a total
ordering “<” of the elements, we have to sort with respect to “<”. Chapter 3
gives an upper bound of

N N
sort(N ) = Θ logM/B I/Os (1.2)
B B
1. Memory Hierarchies — Models and Lower Bounds 11

for sorting. In Section 1.5.1, we will see an almost identical lower bound for
permuting that is also a lower bound for the more diﬃcult problem of sorting.
Searching: Any pointer based data structure indexing N elements needs
access time
search(N ) = Ω (logB N/M ) I/Os. (1.3)
This lower bound is explained in Section 1.5.2. In Chapter 2 we see a matching
upper bound for the simple case of a linear order. High dimensional problems
such as the geometric data structures explained in Chapter 6 can be more
diﬃcult.
Arge and Bro Miltersen [59] give a more detailed account of lower bounds
for external memory algorithms.

1.5.1 Permuting and Sorting

We analyse the following problem. How many I/O operations are necessary
to generate a permutation of the input? A lower bound on permuting implies
a lower bound for sorting because for every permutation of a set of elements,
there is a set of keys that forces sorting to produce this permutation. The
lower bound was established in a seminal paper by Aggarwal and Vitter
[17]. Here we report a simplified proof based on unpublished lecture notes by
Albers, Crauser, and Mehlhorn [24].
To establish a lower bound, we need to specify precisely what a permuta-
tion algorithm can do. We make some restrictions but most of them can be
lifted without changing the lower bound significantly. We view the internal
memory as a bag being able to hold up to M elements. External memory is
an array of elements. Reading and writing external memory is always aligned
to block boundaries, i.e., if the cells of external memory are numbered, ac-
cess is always to cells i, . . . , i + B − 1 such that i is a multiple of B. At the
beginning, the first N/B blocks of external memory contain the input. The
internal memory and the remaining external memory contain no elements. At
the end, the output is again in the first N/B blocks of the external memory.
We view our elements as abstract objects, i.e., the only operation available on
them is to move them around. They cannot be duplicated, split, or modified
in any way. A read step moves B elements from a block of external memory to
internal memory. A write step moves any B elements from internal memory
into a block of external memory. In this model, the following theorem holds:
Theorem 1.1. Permuting N elements takes at least

N log(N/eB)
t≥2 · I/Os.
B log(eM/B) + 2 log(N/B)/B

For N = O (eM/B)B/2 the term log(eM/B) dominates the denominator
and we get a lower bound for sorting of
12 Peter Sanders

N log(N/eB) N N
2 · =Ω logM/B .
B O(log(eM/B)) B B
which is the same as the upper bound from Chapter 3.
The basic approach for establishing Theorem 1.1 is simple. We find an
upper bound ct for the number of different permutations generated after t
I/O steps looking at all possible sequences of t I/O steps. Since there are
N ! possible permutations of N elements, t must be large enough such that
ct ≥ N ! because otherwise there are permutations that cannot be generated
using t I/Os. Solving for t yields the desired lower bound.
A state of the algorithm can be described abstractly as follows:
1. the set of elements in the main memory;
2. the set of elements in each nonempty block of external memory;
3. the permutation in which the elements in each nonempty block of external
memory are stored.
We call two states equivalent if they agree in the first two components (they
may differ in the third).
In the final state, the elements are stored in N/B blocks of B elements
each. Each equivalence class of final states therefore consists of (B!)N/B
states. Hence, it suffices for our lower bound to find out when the num-
ber of equivalence classes of final states Ct reachable after t I/Os exceeds
N !/(B!)N/B .
We estimate Ct inductively. Clearly C0 = 1 .

Ct N/B if the I/O-operation is a read
Lemma 1.2. Ct+1 ≤ M
Ct N/B · B if the I/O-operation is a write.

Proof. A read speciﬁes which out of N/B nonempty blocks is to be read. A

write additionally speciﬁes which elements are to be written
and in which
permutation. If i elements are written, there are Mi ≤ M B choices for the
elements to be written and their permutation
is irrelevant
as far as equivalence
of states is concerned. The inequality Mi ≤ M B assumes that B ≤ M/2.

Lemma 1.3. In any algorithm that produces a permutation in our model,

the number of reads equals the number of writes.
Proof. A read increments the number of empty blocks. A write decrements
the number of empty blocks. At the beginning and at the end there are exactly
N/B nonempty blocks. Hence, the number of increases equals the number of
decreases.
Combining Lemmas 1.2 and 1.3 we see that for even t,
t t/2
N! N M
≤ Ct ≤ · (1.4)
(B!)N/B B B
1. Memory Hierarchies — Models and Lower Bounds 13

We can simplify this relation

M using the the well-known bounds (m/e)m ≤
m! ≤ m . We get B ≤ M B /B! ≤ (eM/B)B and N !/(B!)N/B ≥
m

(N/e)N /B N . Relation 1.4 therefore implies

t Bt/2 N
N eM N
· ≥ ,
B B eB

or, after taking logarithms and solving for t,

N eM N
t · 2 log + B log ≥ 2N log or
B B eB

N log(N/eB)
t≥2 · .
B log(eM/B) + 2 log(N/B)/B

1.5.2 Pointer Based Searching

...
M
... ... ...
B B
... ... ... ... ... ...
Fig. 1.2. Pointer based searching.

Consider a data structure storing a set of N elements in external memory.

Access to blocks of external memory is pointer based, i.e., we are only allowed
to access a block i if its address is actually stored somewhere. We play a
similar game as for the sorting bound and count the number of diﬀerent
blocks Ct that can be accessed after i I/O operations. This count has to
exceed N/B to make all elements accessible. Initially, the fast memory could
be full of pointers so that we have C0 = M . Each additional block read gives
us a factor B more possibilities. Hence, Ct = M B t . Figure 1.2 illustrates this
situation. Solving Ct ≥ N/B yields N ≥ logB MB N
since we need an additional
access for actually retrieving an element we get a lower bound of logB N/M
I/Os for being able to reach each of the N elements.
2. Basic External Memory Data Structures
Rasmus Pagh∗

This chapter is a tutorial on basic data structures that perform well in mem-
ory hierarchies. These data structures have a large number of applications
and furthermore serve as an introduction to the basic principles of designing
data structures for memory hierarchies.
We will assume the reader to have a background in computer science that
includes a course in basic (internal memory) algorithms and data structures.
In particular, we assume that the reader knows about queues, stacks, and
linked lists, and is familiar with the basics of hashing, balanced search trees,
and priority queues. Knowledge of amortized and expected case analysis will
also be assumed. For readers with no such background we refer to one of the
many textbooks covering basic data structures in internal memory, e.g., [216].
The model we use is a simple one that focuses on just two levels of the
memory hierarchy, assuming the movement of data between these levels to
be the main performance bottleneck. (More precise models and a model that
considers all memory levels at the same time are discussed in Chapter 8 and
Chapter 9.) Speciﬁcally, we consider the external memory model described
in Chapter 1.
Our notation is summarized in Fig. 2. The parameters M , w and B de-
scribe the model. The size of the problem instance is denoted by N , where
N ≤ 2w . The parameter Z is query dependent, and is used to state output
sensitive I/O bounds. To reduce notational overhead we take logarithms to
always be at least 1, i.e., loga b should be read “max(loga b, 1)”.

N – number of data items

M – number of data items that can be stored in internal memory
B – number of data items that can be stored in an external memory block
Z – number of data items reported in a query
w – word length of processor and size of data items in bits
Fig. 2.1. Summary of notation

We will not go into details of external memory management, but simply

assume that we can allocate a chunk of contiguous external memory of any
size we desire, such that access to any block in the chunk costs one I/O.
(However, as described in Section 2.2.1 we may use a dictionary to simulate
virtual memory using just one large chunk of memory, incurring a constant
factor I/O overhead).
∗
Part of this work was done while the author was at BRICS, University of Aarhus,
where he was partially supported by the Future and Emerging Technologies
programme of the EU under contract number IST-1999-14186 (ALCOM-FT).

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 14-35, 2003.
© Springer-Verlag Berlin Heidelberg 2003
2. Basic External Memory Data Structures 15

2.1 Elementary Data Structures

We start by going through some of the most elementary data structures.

These are used extensively in algorithms and as building blocks when im-
plementing other data structures. This will also highlight some of the main
diﬀerences between internal and external memory data structures.

2.1.1 Stacks and Queues

Stacks and queues represent dynamic sets of data elements, and support op-
erations for adding and removing elements. They differ in the way elements
are removed. In a stack , a remove operation deletes and returns the set ele-
ment most recently inserted (last-in-first-out), whereas in a queue it deletes
and returns the set element that was first inserted (first-in-first-out).
Recall that both stacks and queues for sets of size at most N can be
implemented efficiently in internal memory using an array of length N and
a few pointers. Using this implementation on external memory gives a data
structure that, in the worst case, uses one I/O per insert and delete operation.
However, since we can read or write B elements in one I/O, we could hope to
do considerably better. Indeed this is possible, using the well-known technique
of a buffer.
An External Stack. In the case of a stack, the buffer is just an internal
memory array of 2B elements that at any time contains the k set elements
most recently inserted, where k ≤ 2B. Remove operations can now be imple-
mented using no I/Os, except for the case where the buffer has run empty.
In this case a single I/O is used to retrieve the block of B elements most
recently written to external memory.
One way of looking at this is that external memory is used to implement
a stack with blocks as data elements. In other words: The “macroscopic view”
in external memory is the same as the “microscopic view” in internal memory.
This is a phenomenon that occurs quite often – other examples will be the
search trees in Section 2.3 and the hash tables in Section 2.4.
Returning to external stacks, the above means that at least B remove
operations are made for each I/O reading a block. Insertions use no I/Os
except when the buffer runs full. In this case a single I/O is used to write the
B least recent elements to a block in external memory. Summing up, both
insertions and deletions are done in 1/B I/O, in the amortized sense. This is
the best performance we could hope for when storing or retrieving a sequence
of data items much larger than internal memory, since no more that B items
can be read or written in one I/O. A desired goal in many external memory
data structures is that when reporting a sequence of elements, only O(1/B)
I/O is used per element. We return to this in Section 2.3.
Exercise 2.1. Why does the stack not use a buffer of size B?
16 Rasmus Pagh

An External Queue. To implement an eﬃcient queue we use two buﬀers

of size B, a read buffer and a write buffer. Remove operations work on the
read buffer until it is empty, in which case the least recently written external
memory block is read into the buffer. (If there are no elements in external
memory, the contents of the write buffer is transfered to the read buffer.)
Insertions are done to the read buffer which when full is written to external
memory. Similar to before, we get at most 1/B I/O per operation.

Problem 2.2. Above we saw how to implement stacks and queues having
a ﬁxed bound on the maximum number of elements. Show how to eﬃciently
implement external stacks and queues with no bound on the number of ele-
ments.

2.1.2 Linked Lists

Linked lists provide an eﬃcient implementation of ordered lists of elements,

supporting sequential search, deletions, and insertion in arbitrary locations
of the list. Again, a direct implementation of the internal memory data struc-
ture could behave poorly in external memory: When traversing the list, the
algorithm may need to perform one I/O every time a pointer is followed. (The
task of traversing an entire linked list on external memory can be performed
more eﬃciently. It is essentially list ranking, described in Chapter 3.)
Again, the solution is to maintain locality, i.e., elements that are near each
other in the list must tend to be stored in the same block. An immediate idea
would be to put chunks of B consecutive elements together in each block and
link these blocks together. This would certainly mean that a list of length N
could be traversed in N/B I/Os. However, this invariant is hard to maintain
when inserting and deleting elements.

Exercise 2.3. Argue that certain insertions and deletions will require N/B
I/Os if we insist on exactly B consecutive elements in every block (except
possibly the last).

To allow for eﬃcient updates, we relax the invariant to require that, e.g.,
there are more than 23 B elements in every pair of consecutive blocks. This
increases the number of I/Os needed for a sequential scan by at most a factor
of three. Insertions can be done in a single I/O except for the case where the
block supposed to hold the new element is full. If either neighbor of the
block has spare capacity, we may push an element to this block. In case both
neighbors are full, we split the block into two blocks of about B/2 elements
each. Clearly this maintains the invariant (in fact, at least B/6 deletions
will be needed before the invariant is violated in this place again). When
deleting an element we check whether the total number of elements in the
block and one of its neighbors is 23 B or less. If this is the case we merge the
two blocks. It is not hard to see that this reestablishes the invariant: Each
2. Basic External Memory Data Structures 17

of the two pairs involving the new block now have more elements than the
corresponding pairs had before.
To sum up, a constant number of I/Os suffice to update a linked list. In
general this is the best we can hope for when updates may affect any part
of the data structure, and we want queries in an (eager) on-line fashion. In
the data structures of Section 2.1.1, updates concerned very local parts of
the data structure (the top of the stack and the ends of the queue), and we
were able to to better. Section 2.3.5 will show that a similar improvement is
possible in some cases where we can afford to wait for an answer of a query
to arrive.

Exercise 2.4. Show that insertion of N consecutive elements in a linked list

can be done in O(1 + N/B) I/Os.

Exercise 2.5. Show how to implement concatenation of two lists and split-
ting of a list into two parts in O(1) I/Os.

Problem 2.6. Show how to increase space utilization from 1/3 to 1 − ,

where > 0 is a constant, with no asymptotic increase in update time. (Hint:
Maintain an invariant on the number of elements in any Θ(1/) consecutive
blocks.)

Pointers. In internal memory one often has pointers to elements of linked

lists. Since memory for each element is allocated separately, a fixed pointer
suffices to identify the element in the list. In external memory elements may
be moved around to ensure locality after updates in other parts of the list, so
a fixed pointer will not suffice. One solution is to maintain a list of pointers to
the pointers, which allows them to be updated whenever we move an element.
If the number of pointers to each element is constant, the task of maintaining
the pointers does not increase the amortized cost of updates by more than
a constant factor, and the space utilization drops only by a constant factor,
assuming that each update costs Ω(1) I/Os. (This need not be the case, as we
saw in Exercise 2.4.) A solution that allows an arbitrary number of pointers to
each list element is to use a dictionary to maintain the pointers, as described
in Section 2.2.1.

2.2 Dictionaries

A dictionary is an abstract data structure that supports lookup queries: Given

a key k from some ﬁnite set K, return any information in the dictionary
associated with k. For example, if we take K to be the set of social security
numbers, a dictionary might associate with each valid social security number
the tax information of its owner. A dictionary may support dynamic updates
in the form of insertions and deletion of keys (with associated information).
18 Rasmus Pagh

Recall that N denotes the number of keys in the dictionary, and that B keys
(with associated information) can reside in each block of external memory.
There are two basic approaches to implementing dictionaries: Search trees
and hashing. Search trees assume that there is some total ordering on the
key set. They oﬀer the highest ﬂexibility towards extending the dictionary to
support more types of queries. We consider search trees in Section 2.3. Hash-
ing based dictionaries, described in Section 2.4, support the basic dictionary
operations in an expected constant number of I/Os (usually one or two). Be-
fore describing these two approaches in detail, we give some applications of
external memory dictionaries.

2.2.1 Applications of Dictionaries

Dictionaries can be used for simple database retrieval as in the example above.
Furthermore, they are useful components of other external memory data
structures. Two such applications are implementations of virtual memory
and robust pointers.
Virtual Memory. External memory algorithms often do allocation and
deallocation of arrays of blocks in external memory. As in internal mem-
ory this can result in problems with fragmentation and poor utilization of
external memory. For almost any given data structure it can be argued that
fragmentation can be avoided, but this is often a cumbersome task.
A general solution that gives a constant factor increase in the number
of I/Os performed is to implement virtual memory using a dictionary. The
key space is K = {1, . . . , C} × {1, . . . , L}, where C is an upper bound of the
number of arrays we will ever use and L is an upper bound on the length of
any array. We wish the ith block of array c to be returned from the dictionary
when looking up the key (c, i). In case the block has never been written to, the
key will not be present, and some standard block content may be returned.
Allocation of an array consists of choosing c ∈ {1, . . . , C} not used for any
other array (using a counter, say), and associating a linked list of length 0
with the key (c, 0). When writing to block i of array c in virtual memory, we
associate the block with the key (c, i) in the dictionary and add the number i
to the linked list of key (c, 0). For deallocation of the array we simply traverse
the linked list of (c, 0) to remove all keys associated with that array.
In case the dictionary uses O(1) I/Os per operation (amortized expected)
the overhead of virtual memory accesses is expected to be a constant factor.
Note that the cost of allocation is constant and that the amortized cost of
deallocation is constant. If the dictionary uses linear space, the amount of
external memory used is bounded by a constant times the amount of virtual
memory in use.
Robust Pointers into Data Structures. Pointers into external memory
data structures pose some problems, as we saw in Section 2.1.2. It is often
2. Basic External Memory Data Structures 19

possible to deal with such problems in speciﬁc cases (e.g., level-balanced B-

trees described in Section 2.3.4), but as we will see now there is a general
solution that, at the cost of a constant factor overhead, enables pointers
to be maintained at the cost of O(1) I/O (expected) each time an element
is moved. The solution is to use a hashing based dictionary with constant
lookup time and expected constant update time to map “pointers” to disk
blocks. In this context a pointer is any kind of unique identiﬁer for a data
element. Whenever an element is moved we simply update the information
associated with its pointer accordingly. Assuming that pointers are succinct
(not much larger than ordinary pointers) the space used for implementing
robust pointers increases total space usage by at most a constant factor.

2.3 B-trees

This section considers search trees in external memory. Like the hashing based
dictionaries covered in Section 2.4, search trees store a set of keys along with
associated information. Though not as efficient as hashing schemes for lookup
of keys, we will see that search trees, as in internal memory, can be used as
the basis for a wide range of efficient queries on sets (see, e.g., Chapter 6 and
Chapter 7). We use N to denote the size of the key set, and B to denote the
number of keys or pointers that fit in one block.
B-trees are a generalization of balanced binary search trees to balanced
trees of degree Θ(B) [96, 207, 416, 460]. The intuitive reason why we should
change to search trees of large degree in external memory is that we would
like to use all the information we get when reading a block to guide the search.
In a naı̈ve implementation of binary search trees there would be no guarantee
that the nodes on a search path did not reside in distinct blocks, incurring
O(log N ) I/Os for a search. As we shall see, it is possible to do significantly
better. In this section it is assumed that B/8 is an integer greater than or
equal to 4.
The following is a modification of the original description of B-trees, with
the essential properties preserved or strengthened. In a B-tree all leaves have
the same distance to the root (the height h of the tree). The level of a B-tree
node is the distance to its descendant leaves. Rather than having a single key
in each internal node to guide searches to one of two subtrees, a B-tree node
guides searches to one of Θ(B) subtrees. In particular, the number of leaves
below a node (called its weight) decreases by a factor of Θ(B) when going
one level down the tree. We use a weight balance invariant, first described
for B-trees by Arge and Vitter [71]: Every node at level i < h has weight at
least (B/8)i , and every node at level i ≤ h has weight at most 4(B/8)i . As
shown in the following exercise, the weight balance invariant implies that the
degree of any non-root node is Θ(B) (this was the invariant in the original
description of B-trees [96]).
20 Rasmus Pagh

Exercise 2.7. Show that the weight balance invariant implies the following:
1. Any node has at most B/2 children.
2. The height of the B-tree is at most 1 + logB/8 N .
3. Any non-root node has at least B/32 children.

Note that B/2 pointers to subtrees, B/2 − 1 keys and a counter of the
number of keys in the subtree all ﬁt in one external memory block of size B.
All keys and their associated information are stored in the leaves of the tree,
represented by a linked list containing the sorted key sequence. Note that
there may be fewer than Θ(B) elements in each block of the linked list if the
associated information takes up more space than the keys.

2.3.1 Searching a B-tree

In a binary search tree the key in a node splits the key set into those keys
that are larger or equal and those that are smaller, and these two sets are
stored separately in the subtrees of the node. In B-trees this is generalized
as follows: In a node v storing keys k1 , . . . , kdv −1 the ith subtree stores keys
k with ki−1 ≤ k < ki (deﬁning k0 = −∞ and kdv = ∞). This means that
the information in a node suﬃces to determine in which subtree to continue
a search.
The worst-case number of I/Os needed for searching a B-tree equals
the worst-case height of a B-tree, found in Exercise 2.7 to be at most
1 + logB/8 N . Compared to an external binary search tree, we save roughly
a factor log B on the number of I/Os.

Example 2.8. If external memory is a disk, the number 1+logB/8 N is quite

small for realistic values of N and B. For example, if B = 212 and N ≤ 227
the depth of the tree is bounded by 4. Of course, the root could be stored in
internal memory, meaning that a search would require three I/Os.

Exercise 2.9. Show a lower bound of Ω(logB N ) on the height of a B-tree.

Problem 2.10. Consider the situation where we have no associated infor-

mation, i.e., we wish to store only the keys. Show that the maximum height
of a B-tree can be reduced to 1 + logB/8 (2N/B) by abandoning the linked
list and grouping adjacent leaves together in blocks of at least B/2. What
consequence does this improvement have in the above example?

Range Reporting. An indication of the strength of tree structures is that

B-trees immediately can be seen to support range queries, i.e., queries of the
form “report all keys in the range [a; b]” (we consider the case where there is
no associated information). This can be done by ﬁrst searching for the key
a, which will lead to the smallest key x ≥ a. We then traverse the linked list
starting with x and report all keys smaller than b (whenever we encounter
2. Basic External Memory Data Structures 21

a block with a key larger than b the search is over). The number of I/Os
used for reporting Z keys from the linked list is O(Z/B), where Z/B is the
minimum number of I/Os we could hope for. The feature that the number
of I/Os used for a query depends on the size of the result is called output
sensitivity. To sum up, Z elements in a given range can be reported by a B-
tree in O(logB N + Z/B) I/Os. Many other reporting problems can be solved
within this bound.
It should be noted that there exists an optimal size (static) data struc-
ture based on hashing that performs range queries in O(1 + Z/B) I/Os [35].
However, a slight change in the query to “report the ﬁrst Z keys in the range
[a; b]” makes the approach used for this result fail to have optimal output
sensitivity (in fact, this query provably has a time complexity that grows
with N [98]). Tree structures, on the other hand, tend to easily adapt to such
changes.

2.3.2 Inserting and Deleting Keys in a B-tree

Insertions and deletions are performed as in binary search trees except for
the case where the weight balance invariant would be violated by doing so.
Inserting. When inserting a key x we search for x in the tree to ﬁnd the
internal node that should be the parent of the leaf node for x. If the weight
constraint is not violated on the search path for x we can immediately insert
x, and a pointer to the leaf containing x and its associated information. If
the weight constraint is violated in one or more nodes, we rebalance it by
performing split operations in overweight nodes, starting from the bottom
and going up. To split a node v at level i > 0, we divide its children into
two consecutive groups, each of weight between 2(B/8)i − 2(B/8)i−1 and
2(B/8)i + 2(B/8)i−1 . This is possible as the maximum weight of each child is
4(B/8)i−1 . Node v is replaced by two nodes having these groups as children
(this requires an update of the parent node, or the creation of a new root if v
is the root). Since B/8 ≥ 4 the weight of each of these new nodes is between
3 i 5 i i
2 (B/8) and 2 (B/8) , which is Ω((B/8) ) away from the limits.
Deleting. Deletions can be handled in a manner symmetric to insertions.
Whenever deleting a leaf would violate the lower bound on the weight of a
node v, we perform a rebalancing operation on v and a sibling w. If several
nodes become underweight we start the rebalancing at the bottom and move
up the tree.
Suppose v is an underweight node at level i, and that w is (one of) its
nearest sibling(s). In case the combined weight of v and w is less than 72 (B/8)i
we fuse them into one node having all the children of v and w as children. In
case v and w were the only children of the root, this node becomes the new
root. The other case to consider is when the combined weight is more than
7 i i
2 (B/8) , but at most 5(B/8) (since v is underweight). In this case we make
w share some children with v by dividing all the children into two consecutive
22 Rasmus Pagh

groups, each of weight between 74 (B/8)i −2(B/8)i−1 and 52 (B/8)i +2(B/8)i−1 .

These groups are then made the children of v and w, respectively. In both
cases, the weight of all changed nodes is Ω((B/8)i ) away from the limits.
An alternative to doing deletions in this way is to perform periodical global
rebuilding, a technique described in Section 2.5.2.
Analysis. The cost of rebalancing a node is O(1) I/Os, as it involves a con-
stant number of B-tree nodes. This shows that B-tree insertions and deletions
can be done in O(logB N ) I/Os.
However, we have in fact shown something stronger. Suppose that when-
ever a level i node v of weight W = Θ((B/8)i ) is rebalanced we spend f (W )
I/Os to compute an auxiliary data structure used when searching in the sub-
tree with root v. The above weight balancing arguments show that Ω(W )
insertions and deletions in v’s subtree are needed for each rebalancing op-
eration. Thus, the amortized cost of maintaining the auxiliary data struc-
tures is O(f (W )/W ) I/Os per node on the search path of an update, or
O( f (W )
W logB N ) I/Os per update in total. As an example, if the auxiliary
data structure can be constructed by scanning the entire subtree in O(W/B)
I/Os, the amortized cost per update is O( B1 logB N ) I/Os, which is negligible.

Problem 2.11. Modify the rebalancing scheme to support the following

type of weight balance condition: A B-tree node at level i < h is the root
of a subtree having Θ((B/(2 + ))i ) leaves, where > 0 is a constant. What
consequence does this have for the height of the B-tree?

2.3.3 On the Optimality of B-trees

As seen in Chapter 1 the bound of O(logB N ) I/Os for searching is the best
we can hope for if we consider algorithms that use only comparisons of keys
to guide searches. If we have a large amount of internal memory and are
willing to use it to store the top M/B nodes of the B-tree, the number of
I/Os for searches and updates drops to O(logB (N/M )).

Exercise 2.12. How large should internal memory be to make O(logB (N/
M )) asymptotically smaller than O(logB N )?

There are non-comparison-based data structures that break the above

bound. For example, the predecessor dictionary mentioned in Section 2.4.2
uses linear space and time O(log w) to search for a key, where w denotes the
number of bits in a key (below we call the amount of storage for a key a
word). This is faster than a B-tree if N is much larger than B and w. Note
that a predecessor dictionary also supports the range queries discussed in
Section 2.3.1. There are also predecessor data structures whose search time
improves with the number of bits that can be read in one step (in our case
Bw bits). When translated to external memory, these results (see [48, 98, 373]
and the references therein) can be summarized as follows:
2. Basic External Memory Data Structures 23

Theorem 2.13. There is an external memory data structure for N keys of w

bits that supports deterministic predecessor queries, insertions and deletions
in the following worst-case number of I/Os:

N N N log w N
O min log BM / log log BM , logBw M , log log w log logBw M

where internal space usage is O(M ) words and external space usage is
O(N/B) blocks of B words.

Using randomization it is also possible to perform all operations in ex-

pected O(log w) time, O(B) words of internal space, and O(N/B) blocks of
external space [765]. If main memory size is close to the block size, the upper
bounds on predecessor queries are close to optimal, as shown by Beame and
Fich [98] in the following general lower bounds:

Theorem 2.14. Suppose there is a (static) dictionary for w bit keys us-
ing N O(1) blocks of memory that supports predecessor queries in t I/Os,
worst-case, using O(B) words of internal memory. Then the following bounds
hold:
1. t = Ω(min(log w/ log log w, logBw N )).
2. If w is a suitable function of N then t = Ω(min(logB N, log N/ log log N )),
i.e., no better bound independent of w can be achieved.

Exercise 2.15. For what parameters are the upper bounds of Theorem 2.13
within a constant factor of the lower bounds of Theorem 2.14?

Though the above results show that it is possible to improve slightly

asymptotically upon the comparison-based upper bounds, the possible sav-
ings are so small (logB N tends to be a small number already) that it has been
common to stick to the comparison-based model. Another reason is that much
of the development of external memory algorithms has been driven by com-
putational geometry applications. Geometric problems are usually studied in
a model where numbers have inﬁnite precision and can only be manipulated
using arithmetic operations and comparisons.

2.3.4 B-tree Variants

There are many variants of B-trees that add or enhance properties of basic
B-trees. The weight balance invariant we considered above was introduced in
the context of B-trees only recently, making it possible to associate expensive
auxiliary data structures with B-tree nodes at small amortized cost. Below we
summarize the properties of some other useful B-tree variants and extensions.
24 Rasmus Pagh

Parent Pointers and Level Links. It is simple to extend basic B-trees

to maintain a pointer to the parent of each node at no additional cost. A
similarly simple extension is to maintain that all nodes at each level are
connected in a doubly linked list. One application of these pointers is a finger
search: Given a leaf v in the B-tree, search for another leaf w. We go up
the tree from v until the current node or one of its level-linked neighbors
has w below it, and then search down the tree for w. The number of I/Os is
O(logB Q), where Q is the number of leaves between v and w. When searching
for nearby leaves this is a significant improvement over searching for w from
the root.
Divide and Merge Operations. In some applications it is useful to be
able to divide a B-tree into two parts, with keys smaller than and larger than
some splitting element, respectively. Conversely, if we have two B-trees where
all keys in one is smaller than all keys in the other, we may wish to efficiently
“glue” these trees together into one. In normal B-trees these operations can
be supported in O(logB N )) I/Os. However, it is not easy to simultaneously
maintain parent pointers. Level-balanced B-trees [4] maintain parent pointers
and support divide, merge, and usual B-tree operations in O(logB N )) I/Os.
If there is no guarantee that keys in one tree are smaller than keys in the
other, merging is much harder, as shown in the following problem.

Problem 2.16. Show that, in the lower bound model of Aggarwal and Vit-
ter [17], merging two B-trees with Θ(N ) keys requires Θ(N/B) I/Os in the
worst case.

Partially Persistent B-trees. In partially persistent B-trees (sometimes

called multiversion B-trees) each update conceptually results in a new ver-
sion of the tree. Queries can be made in any version of the tree, which is useful
when the history of the data structure needs to be stored and queried. Per-
sistence is also useful in many geometric algorithms based on the sweepline
paradigm (see Chapter 6).
Partially persistent B-trees can be implemented as efficiently as one could
hope for, using standard internal memory persistence techniques [258, 661]
A sequence of N updates results in a data structure using O(N/B) external
memory blocks, where any version of the tree can be queried in O(logB N )
I/Os. Range queries, etc., are also supported. For details we refer to [99, 661,
749].
String B-trees. We have assumed that the keys stored in a B-tree have fixed
length. In some applications this is not the case. Most notably, in String B-
trees [296] the keys are strings of unbounded length. It turns out that all the
usual B-tree operations, as well as a number of operations specific to strings,
can be efficiently supported in this setting. String B-trees are presented in
Chapter 7.
2. Basic External Memory Data Structures 25

2.3.5 Batched Dynamic Problems and Buﬀer Trees

B-trees answer queries in an on-line fashion, i.e., the answer to a query is

provided immediately after the query is issued. In some applications we can
afford to wait for an answer to a query. For example, in batched dynamic
problems a “batch” of updates and queries is provided to the data structure,
and only at the end of the batch is the data structure expected to deliver the
answers that would have been returned immediately by the corresponding
on-line data structure.
There are many examples of batched dynamic problems in, e.g., compu-
tational geometry. As an example, consider the batched range searching prob-
lem: Given a sequence of insertions and deletions of integers, interleaved with
queries for integer intervals, report for each interval the integers contained in
it. A data structure for this problem can, using the sweepline technique, be
used to solve the orthogonal line segment intersection problem: Given a set
of horizontal and vertical lines in the plane, report all intersections. We refer
to Chapter 6 for details.
Buffer Trees. The buffer tree technique [52] has been used for I/O optimal
algorithms for a number of problems. In this section we illustrate the basic
technique by demonstrating how a buffer tree can be used to handle batched
dictionary operations. For simplicity we will assume that the information
associated with keys has the same size as the keys.
A buffer tree is similar to a B-tree, but has degree Θ(M/B). Its name
refers to the fact that each internal node has an associated buffer which
is a queue that contains a sequence of up to M updates and queries to be
performed in the subtree where the node is root. New updates and queries are
not performed right away, but “lazily” written to the root buffer in O(1/B)
I/Os per operation, as described in Section 2.1.1. Non-root buffers reside
entirely on external memory, and writing K elements to them requires O(1 +
K/B) I/Os.
Whenever a buffer gets full, it is flushed: Its content is loaded into internal
memory, where the updates and queries are sorted according to the subtree
where they have to be performed. These operations are then written to the
buffers of the Θ(M/B) children, in the order they were originally carried out.
This may result in buffers of children flushing, and so forth. Leaves contain
Θ(B) keys. When the buffer of a node v just above the leaves is flushed,
the updates and queries are performed directly on its M/B children, whose
elements fit in main memory. This results in a sorted list of blocks of elements
that form the new children of v. If there are too few or too many children,
rebalancing operations are performed, similar to the ones described for B-
trees (see [52] for details). Each node involved in a rebalancing operation has
its buffer flushed before the rebalancing is done. In this way, the content of
the buffers need not be considered when splitting, fusing, and sharing.
The cost of flushing a buffer is O(M/B) I/Os for reading the buffer, and
O(M/B) I/Os for writing the operations to the buffers of the children. Note
26 Rasmus Pagh

that there is a cost of a constant number of I/Os for each child – this is the
reason for making the number of children equal to the I/O-cost of reading
the buffer. Thus, flushing costs O(1/B) I/Os per operation in the buffer, and
since the depth of the tree is O(log M ( N
B )), the total cost of all flushes is
B
O( B1 log M ( N
B )) I/Os per operation.
B
The cost of performing a rebalancing operation on a node is O(M/B)
I/Os, as we may need to flush the buffer of one of its siblings. However, the
number of rebalancing operations during N updates is O(N/M ) (see [416]),
so the total cost of rebalancing is O(N/B) I/Os.
Problem 2.17. What is the I/O complexity of operations in a “buffer tree”
of degree Q?

2.3.6 Priority Queues

The priority queue is an abstract data structure of fundamental importance,

primarily due to its use in graph algorithms (see Chapter 4 and Chapter 5). A
priority queue stores an ordered set of keys, along with associated information
(assumed in this section to be of the same size as keys). The basic operations
are: insertion of a key, finding the smallest key, and deleting the smallest
key. (Since only the smallest key can be inspected, the key can be thought of
a priority, with small keys being “more important”.) Sometimes additional
operations are supported, such as deleting an arbitrary key and decreasing
the value of a key. The motivation for the decrease-key operation is that it
can sometimes be implemented more efficiently than by deleting the old key
and inserting the new one.
There are several ways of implementing efficient external memory pri-
ority queues. Like for queues and stacks (which are both special cases of
priority queues), the technique of buffering is the key. We show how to use
the buffer tree data structure described in Section 2.3.5 to implement a pri-
ority queue using internal memory O(M ), supporting insertion, deletion and
delete-minimum in O( B1 log M ( NB )) I/Os, amortized, while keeping the mini-
B
mum element in internal memory.
The entire buffer of the root node is always kept in internal memory.
Also present in memory are the O(M/B) leftmost leaves, more precisely the
leaves of the leftmost internal node. The invariant is kept that all buffers on
the path from the root to the leftmost leaf are empty. This is done in the
obvious fashion: Whenever the root is flushed we also flush all buffers down
the leftmost path, at a total cost of O( M B log M (N
B )) I/Os. Since there are
B
O(M/B) operations between each flush of the root buffer, the amortized cost
of these extra flushes is O( B1 log M ( N
B )) I/Os per operation. The analysis is
B
completed by the following exercise.
Exercise 2.18. Show that the current minimum can be maintained inter-
nally using only the root buffer and the set of O(M ) elements in the leftmost
2. Basic External Memory Data Structures 27

leaves. Conclude that ﬁnd-minimum queries can be answered on-line without

using any I/Os.

Optimality. It is not hard to see that the above complexities are, in a certain
sense, the best possible.
Exercise 2.19. Show that it is impossible to perform insertion and delete-
minimums in time o( B1 log M ( N
B )) (Hint: Reduce from sorting, and use the
B
sorting lower bound – more information on this reduction technique can be
found in Chapter 6).
In internal memory it is in fact possible to improve the complexity of inser-
tion to constant time, while preserving O(log N ) time for delete-minimum
(see [216, Chapter 20] and [154]). It appears to be an open problem whether
it is possible to implement constant time insertions in external memory.
One way of improving the performance the priority queue described is to
provide “worst case” rather than amortized I/O bounds. Of course, it is not
possible for every operation to have a cost of less than one I/O. The best one
k
can hope for is that any subsequence of k operations uses O(1 + B log M ( N
B ))
B
I/Os. Brodal and Katajainen [157] have achieved this for subsequences of
length k ≥ B. Their data structure does not support deletions.
A main open problem in external memory priority queues is the com-
plexity of the decrease-key operation (when the other operations have com-
plexity as above). Internally, this operation can be supported in constant
time (see [216, Chapter 20] and [154]), and the open problem is whether a
corresponding bound of O(1/B) I/Os per decrease-key can be achieved. The
currently best complexity is achieved by “tournament trees”, described in
Chapter 4, where decrease-key operations, as well as the other priority queue
operations, cost O( B1 log( N
B )) I/Os.

2.4 Hashing Based Dictionaries

We now consider hashing techniques, which oﬀer the highest performance for
the basic dictionary operations. One aspect that we will not discuss here,
is how to implement appropriate classes of hash functions. We will simply
assume to have access to hash functions that behave like truly random func-
tions, independent of the sequence of dictionary operations. This means that
any hash function value h(x) is uniformly random and independent of hash
function values on elements other than x. In practice, using easily imple-
mentable “pseudorandom” hash functions that try to imitate truly random
functions, the behavior of hashing algorithms is quite close to that of this
idealized model. We refer the reader to [251] and the references therein for
more information on practical hash functions.
28 Rasmus Pagh

2.4.1 Lookup with Good Expected Performance

Several classic hashing schemes (see [460, Section 6.4] for a survey) perform
well in the expected sense in external memory. We will consider linear probing
and chaining with separate lists. These schemes need nothing but a single
hash function h in internal memory (in practice a few machine words suffice
for a good pseudorandom hash function). For both schemes the analysis is
beyond the scope of this chapter, but we provide some intuition and state
results on their performance.
Linear Probing. In external memory linear probing, a search for the key x
starts at block h(x) in a hash table, and proceeds linearly through the table
until either x is found or we encounter a block that is not full (indicating
that x is not present in the table). Insertions proceed in the same manner as
lookups, except that we insert x if we encounter a non-full block. Deletion
of a key x requires some rearrangement of the keys in the blocks scanned
when looking up x, see [460, Section 6.4] for details. A deletion leaves the
table in the state it would have been in if the deleted element had never been
inserted.
The intuitive reason that linear probing gives good average behavior is
that the pseudorandom function distributes the keys almost evenly to the
blocks. In the rare event that a block overflows, it will be unlikely that the
next block is not able to accommodate the overflow elements. More precisely,
if the load factor of our hash table is α, where 0 < α < 1 (i.e., the size of
the hash table is N/(αB) blocks), we have that the expected average number
of I/Os for a lookup is 1 + (1 − α)−2 · 2−Ω(B) [460]. If α is bounded away
from 1 (i.e., α ≤ 1 − for some constant > 0) and if B is not too small,
the expected average is very close to 1. In fact, the asymptotic probability of
having to use k > 1 I/Os for a lookup is 2−Ω(B(k−1)) . In Section 2.4.4 we will
consider the problem of keeping the load factor in a certain range, shrinking
and expanding the hash table according to the size of the set.
Chaining with Separate Lists. In chaining with separate lists we again
hash to a table of size approximately N/(αB) to achieve load factor α. Each
block in the hash table is the start of a linked list of keys hashing to that
block. Insertion, deletion, and lookups proceed in the obvious manner. As the
pseudorandom function distributes keys approximately evenly to the blocks,
almost all lists will consist of just a single block. In fact, the probability
that more than kB keys hash to a certain block, for k ≥ 1, is at most
e−αB(k/α−1) /3 by Chernoff bounds (see, e.g., [375, Eq. 6]).
2

As can be seen, the probabilities decrease faster with k than in linear

probing. On the other hand, chaining may be slightly more complicated to
implement as one has to manage 2−Ω(B) n blocks in chained lists (expected).
Of course, if B is large and the load factor is not too high, overﬂows will be
very rare. This can be exploited, as discussed in the next section.
2. Basic External Memory Data Structures 29

2.4.2 Lookup Using One External Memory Access

In the previous section we looked at hashing schemes with good expected

lookup behavior. Of course, an expected bound may not be good enough for
some applications where a firm guarantee on throughput is needed. In this
and the following section we investigate how added resources may provide
dictionaries in which lookups take just the time of a single I/O in the worst
case. In particular, we consider dictionaries using more internal memory, and
dictionaries using external memory that allows two I/Os to be performed in
parallel.
Making Use of Internal Memory. An important design principle in ex-
ternal memory algorithms is to make full use of internal memory for data
structures that reduce the number of external memory accesses. Typically
such an internal data structure holds part of the external data structure that
will be needed in the future (e.g., the buffers used in Section 2.1), or it holds
information that allows the proper data to be found efficiently in external
memory.
If sufficient internal memory is available, searching in a dictionary can be
done in a single I/O. There are at least two approaches to achieving this.
Overflow area.. When internal memory for 2−Ω(B) N keys and associated in-
formation is available internally, there is a very simple strategy that provides
lookups in a single I/O, for constant load factor α < 1. The idea is to store
the keys that cannot be accommodated externally (because of block over-
flows) in an internal memory dictionary. For some constant c(α) = Ω(1 − α)
the probability that there are more than 2−c(α)B N such keys is so small (by
the Chernoff bound) that we can afford to rehash, i.e., choose a new hash
function to replace h, if this should happen.
Alternatively, the overflow area can reside in external memory (this idea
appeared in other forms in [341, 623]). To guarantee single I/O lookups this
requires internal memory data structures that:
– Identify blocks that have overflown.
– Facilitate single I/O lookup of the elements hashing to these blocks.
The first task can be solved by maintaining a dictionary of overflowing blocks.
The probability of a block overflowing is O(2−c(α)B ), so we expect to store
the indices of O(2−c(α)B N ) blocks. This requires O(2−c(α)B N log N ) bits of
internal space. If we simply discard the external memory blocks that have
overflown, the second task can be solved recursively by a dictionary sup-
porting single I/O lookups, storing a set that with high probability has size
O(2−c(α)B N ).
Perfect hashing.. Mairson [525] considered implementing a B-perfect hash
function p : K → {1, . . . , N/B} that maps at most B keys to each block.
Note that if we store key k in block p(k) and the B-perfect hash function
resides in internal memory, we need only a single I/O to look up k. Mairson
30 Rasmus Pagh

showed that such a function can be implemented using O(N log(B)/B) bits
of internal memory. (In the interest of simplicity, we ignore an extra term
ω(N )
that only shows up when the key set K has size 2B .) If the number of
external blocks is only N/B and we want to be able to handle every possible
key set, this is also the best possible [525]. Unfortunately, the time and space
needed to evaluate Mairson’s hash functions is extremely high, and it seems
very difficult to obtain a dynamic version. The rest of this section deals with
more practical ways of implementing (dynamic) B-perfect hashing.
Extendible Hashing. A popular B-perfect hashing method that comes
close to Mairson’s bound is extendible hashing by Fagin et al. [285]. The
expected space utilization in external memory is about 69% rather than the
100% achieved by Mairson’s scheme.
Extendible hashing employs an internal structure called a directory to
determine which external block to search. The directory is an array of 2d
pointers to external memory blocks, for some parameter d. Let h : K →
{0, 1}r be a truly random hash function, where r ≥ d. Lookup of a key k is
performed by using h(k)d , the function returning the d least significant bits
of h(k), to determine an entry in the directory, which in turn specifies the
external block to be searched. The parameter d is chosen to be the smallest
number for which at most B dictionary keys map to the same value under
h(k)d . If r ≥ 3 log N , say, such a d exists with high probability. In case it
does not we simply rehash. Many pointers in the directory may point to the
same block. Specifically, if no more than B dictionary keys map to the same
value v under hd , for some d < d, all directory entries with indices having
v in their d least significant bits point to the same external memory block.
Clearly, extendible hashing provides lookups using a single I/O and con-
stant internal processing time. Analyzing its space usage is beyond the scope
of this chapter, but we mention some results. Flajolet [305] has shown √ that
the expected number of entries in the directory is approximately 4 N B
B
N . If
B is just moderately large, this is close to optimal, e.g., in case B ≥ log N
the number of bits used is less than 8N log(N )/B. In comparison, the opti-
mal space bound for perfect hashing to exactly N/B external memory blocks
is 12 N log(B)/B + Θ(N/B) bits. The expected external space usage can be
shown to be around N/(B ln 2) blocks, which means that about 69% of the
space is utilized [285, 545].
Extendible hashing is named after the way in which it adapts to changes
of the key set. The level of a block is the largest d ≤ d for which all its keys
map to the same value under hd . Whenever a block at level d has run full,
it is split into two blocks at level d + 1 using hd +1 . In case d = d we first
need to double the size of the directory. Conversely, if two blocks at level d ,
with keys having the same function value under hd −1 , contain less than B
keys in total, these blocks are merged. If no blocks are left at level d, the size
of the directory is halved.
2. Basic External Memory Data Structures 31

Using a Predecessor Dictionary. If one is willing to increase internal

computation from a constant to expected O(log log N ) time per dictionary
operation, both internal and external space usage can be made better than
that of extendible hashing. The idea is to replace the directory with a dic-
tionary supporting predecessor queries in a key set P ⊆ {0, 1}r : For any
x ∈ {0, 1}r it reports the largest key y ∈ P such that y ≤ x, along with some
information associated with this key. In our application the set P will be the
hash values of a small subset of the set of keys in the dictionary.
We will keep the keys of the dictionary stored in a linked list, sorted
according to their hash values (interpreted as nonnegative integers). For each
block in the linked list we keep the smallest hash value in the predecessor
dictionary, and associate with it a pointer to the block. This means that
lookup of x ∈ K can be done by searching the block pointed to by the
predecessor of h(x). Insertions and deletions can be done by inserting or
deleting the key in the linked list, and possibly making a constant number of
updates to the predecessor dictionary.
We saw in Problem 2.6 that a linked list with space utilization 1 − can
be maintained in O(1) I/O per update, for any constant > 0. The internal
predecessor data structure then contains at most N/((1 − )B) keys. We
choose the range of the hash function such that 3 log N ≤ r = O(log N ).
Since the hash function values are only O(log N ) bits long, one can imple-
ment a very eﬃcient linear space predecessor dictionary based on van Emde
Boas trees [745, 746]. This data structure [765] allows predecessor queries
to be answered in time O(log log N ), and updates to be made in expected
O(log log N ) time. The space usage is linear in the number of elements stored.
In conclusion, we get a dictionary supporting updates in O(1) I/Os and
time O(log log N ), expected, and lookups in 1 I/O and time O(log log N ).
For most practical purposes the internal processing time is negligible. The
external space usage can be made arbitrarily close to optimal, and the internal
space usage is O(N/B).

2.4.3 Lookup Using Two Parallel External Memory Accesses

We now consider a scenario in which we may perform two I/Os in parallel, in

two separate parts of external memory. This is realistic, for example, if two
disks are available or when RAM is divided into independent banks. It turns
out that, with high probability, all dictionary operations can be performed
accessing just a single block in each part of memory, assuming that the load
factor α is bounded away from 1 and that blocks are not too small.
The hashing scheme achieving this is called two-way chaining, and was
introduced by Azar et al. [77]. It can be thought of as two chained hashing
data structures with pseudorandom hash functions h1 and h2 . Key x may
reside in either block h1 (x) of hash table one or block h2 (x) of hash table
two. New keys are always inserted in the block having the smallest number of
keys, with ties broken such that keys go to table one (the advantages of this
32 Rasmus Pagh

tie-breaking rule were discovered by Vöcking [756]). It can be shown that the
Ω((1−α)B)
probability of an insertion causing an overflow is N/22 [115]. That
is, the failure probability decreases doubly exponentially with the average
number of free spaces in each block. The constant factor in the Ω is larger than
1, and it has been shown experimentally that even for very small amounts of
free space in each block, the probability of an overflow (causing a rehash) is
very small [159]. The effect of deletions in two-way chaining does not appear
to have been analyzed.

2.4.4 Resizing Hash Tables

In the above we several times assumed that the load factor of our hash table
is at most some constant α < 1. Of course, to keep the load factor below
α we may have to increase the size of the hash table employed when the
size of the set increases. On the other hand we wish to keep α above a
certain threshold to have a good external memory utilization, so shrinking
the hash table is also occasionally necessary. The challenge is to rehash to
the new table without having to do an expensive reorganization of the old
hash table. Simply choosing a new hash function would require a random
permutation of the keys, a task shown in [17] to require Θ( NB log M (N
B )) I/Os.
B
O(B)
When N = (M/B) , i.e, when N is not extremely large, this is O(N ) I/Os.
Since one usually has Θ(N ) updates between two rehashes, the reorganization
cost can be amortized over the cost of updates. However, more eﬃcient ways
of reorganizing the hash table are important in practice to keep constant
factors down. The basic idea is to introduce more “gentle” ways of changing
the hash function.
Linear Hashing. Litwin [508] proposed a way of gradually increasing and
decreasing the range of hash functions with the size of the set. The basic
idea for hashing to a range of size r is to extract b = log r bits from a
“mother” hash function. If the extracted bits encode an integer k less than r,
this is used as the hash value. Otherwise the hash function value k − 2b−1 is
returned. When expanding the size of the hash table by one block (increasing
r by one), all keys that may hash to the new block r + 1 previously hashed to
block r + 1 − 2b−1 . This makes it easy to update the hash table. Decreasing
the size of the hash table is done in a symmetric manner.
The main problem with linear hashing is that when r is not a power
of 2, the keys are not mapped uniformly to the range. For example, if r is
1.5 times a power of two, the expected number of collisions between keys is
12.5% higher than that expected for a uniform hash function. Even worse,
the expected maximum number of keys hashing to a single bucket can be up
to twice as high as in the uniform case. Some attempts have been made to
alleviate these problems, but all have the property that the hash functions
used are not completely uniform, see [497] and the references therein. Another
problem lies in the analysis, which for many hashing schemes is complicated
2. Basic External Memory Data Structures 33

by nonuniform hash functions. Below we look at a way of doing eﬃcient

rehashing in a uniform way.
Uniform Rehashing. We now describe an alternative to linear hashing
that yields uniform hash functions [597]. To achieve both uniformity and
eﬃcient rehashing we do not allow the hash table size to increase/decrease
in increments of 1, but rather support that its size is increased/decreased by
a factor of around 1 + for some > 0. This means that we are not able
to control exactly the relative sizes of the set and hash table. On the other
hand, uniformity means that we will be able to achieve the performance of
linear hashing using a smaller hash table.
As in linear hashing we extract the hash function value for all ranges
from a “mother” hash function φ : U → {0, . . . , R − 1}. The factor between
consecutive hash table sizes will be between 1+1 and 1+2, where 2 > 1 > 0
are arbitrary constants. The size R of the range of φ is chosen as follows. Take
a sequence of positive integers i1 , . . . , ik such that ik = 2p ·i1 for some positive
integer p, and 1 + 1 < ij+1 /ij < 1 + 2 for j = 1, . . . , k − 1.
Exercise 2.20. Show that i1 , .
. . , ik can be chosen to satisfy the above re-
k
quirements, and such that I = j=1 ij is a constant (depending only on 1
and 2 ).
We let R = 2b · I, where I is deﬁned in Exercise 2.20 and b is chosen such
that no hash function with range larger than 2b ik will be needed. Whenever
r divides R we have the uniformly random hash function with range of size r:
hr = φ(r) div (R/r), where div denotes integer division. The possible range
sizes are 2q ij for q = 0, . . . , b, j = 1, . . . , k. If the current range size is r = 2q ij
and we wish to hash to a larger table, we choose new range r = 2q ij+1 if
j < k and r = 2q+p i2 if j = k. By the way we have chosen i1 , . . . , ik it holds
that 1 + 1 < r /r < 1 + 2 . The case where we wish to hash to a smaller
table is symmetric.
The following property of our hash functions means that, for many hash-
ing schemes, rehashing can be performed by a single scan through the hash
table (i.e., in O(N/B) I/Os): If φ(x) ≤ φ(y) then hr (x) ≤ hr (y) for any r.
In other words, our hash functions never change the ordering of hash values
given by φ.

2.5 Dynamization Techniques

This section presents two general techniques for obtaining dynamic data
structures for sets.

2.5.1 The Logarithmic Method

In many cases it is considerably simpler to come up with an eﬃcient way of
constructing a static data structure than achieving a correspondingly eﬃcient
34 Rasmus Pagh

dynamic data structure. The logarithmic method is a technique for obtaining

data structures with efficient (though often not optimal) insertion and query
operations in some of these cases. More specifically, the problem must be
decomposable: If we split the set S of elements into disjoint subsets S1 , . . . , Sk
and create a (static) data structure for each of them, then queries on the whole
set can be answered by querying each of these data structures. Examples of
decomposable problems are dictionaries and priority queues.
The basic idea of the logarithmic method is to maintain a collection
of data structures of different sizes, and periodically merge a number data
structures into one, in order to keep the number of data structures to be
queried low. In internal memory, the number of data structures for a set of
size N is typically O(log N ), explaining the name of the method. We refer
to [109, 110, 600] and the references therein for more background.
In the external memory version of the logarithmic method that we de-
scribe [69], the number of data structures used is decreased to O(logB N ).
Insertions are done by rebuilding the first static data structure such that it
contains the new element. The invariant is that the ith data structure should
have size no more than B i . If this size is reached, it is merged with the i + 1st
data structure (which may be empty). Merging is done by rebuilding a static
data structure containing all the elements of the two data structures.

Exercise 2.21. Show that when inserting N elements, each element will be
part of a rebuilding O(B logB N ) times.

Suppose that building a static data structure for N elements uses

k
O( N
B logB N ) I/Os. Then by the exercise, the total amortized cost of in-
serting an element is O(logk+1
B N ) I/Os. Queries take O(logB N ) times more
I/Os than queries in the corresponding static data structures.

2.5.2 Global Rebuilding

Some data structures for sets support deletions, but do not recover the space
occupied by deleted elements. For example, deletions in a static dictionary can
be done by marking deleted elements (this is called a weak delete). A general
technique for keeping the number of deleted elements at some fraction of
the total number of elements is global rebuilding: In a data structure of N
elements (present and deleted), whenever αN elements have been deleted,
for some constant α > 0, the entire data structure is rebuilt. The cost of
rebuilding is at most a constant factor higher than the cost of inserting αN
elements, so the amortized cost of global rebuilding can be charged to the
insertions of the deleted elements.

Exercise 2.22. Discuss pros and cons of using global rebuilding for B-trees
instead of the deletion method described in Section 2.3.2.
2. Basic External Memory Data Structures 35

2.6 Summary

This chapter has surveyed some of the most important external memory data
structures for sets and lists: Elementary abstract data structures (queues,
stacks, linked lists), B-trees, buffer trees (including their use for priority
queues), and hashing based dictionaries. Along the way, several important
design principles for memory hierarchy aware algorithms and data structures
have been touched upon: Using buffers, blocking and locality, making use
of internal memory, output sensitivity, data structures for batched dynamic
problems, the logarithmic method, and global rebuilding. In the following
chapters of this volume, the reader who wants to know more can find a wealth
of information on virtually all aspects of algorithms and data structures for
memory hierarchies.
Since the data structure problems discussed in this chapter are fundamen-
tal they are well-studied. Some problems have resisted the efforts of achiev-
ing external memory results “equally good” as the corresponding internal
memory results. In particular, the problems of supporting fast insertion and
decrease-key in priority queues (or show that this is not possible) have re-
mained challenging open research problems.
Acknowledgements. The surveys by Arge [55], Enbody and Du [280], and
Vitter [753, 754] were a big help in writing this chapter. I would also like to
acknowledge the help of Gerth Stølting Brodal, Ulrich Meyer, Anna Östlin,
Jan Vahrenhold, Berthold Vöcking, and last but not least the participants of
the GI-Dagstuhl-Forschungsseminar “Algorithms for Memory Hierarchies”.
3. A Survey of Techniques for Designing
I/O-Efficient Algorithms∗
Anil Maheshwari and Norbert Zeh

3.1 Introduction

This survey is meant to give an introduction to elementary techniques used for

designing I/O-efficient algorithms. We do not intend to give a complete survey
of all state-of-the-art techniques; but rather we aim to provide the reader with
a good understanding of the most elementary techniques. Our focus is on
general techniques and on techniques used in the design of I/O-efficient graph
algorithms. We include the latter because many abstract data structuring
problems can be translated into classical graph problems. While this fact
is of mostly philosophical interest in general, it gains importance in I/O-
efficient algorithms because random access is penalized in external memory
algorithms and standard techniques to extract information from graphs can
help when trying to extract information from pointer-based data structures.
For the analysis of the I/O-complexity of the algorithms, we adopt the
Parallel Disk Model (PDM) (see Chapter 1) as the model of computation.
We restrict our discussion to the single-disk case (D = 1) and refer the reader
to appropriate references for the case of multiple disks. In order to improve
the readability of the text, we do not worry too much about the integrality of
parameters that arise in the discussion. That is, we write x/y to denote x/y
√
or x/y, as appropriate. The same applies to expressions such as log x, x,
etc.
We begin our discussion in Section 3.2 with an introduction to two general
techniques that are applied in virtually all I/O-efficient algorithms: sorting
and scanning. In Section 3.3 we describe a general technique to derive I/O-
efficient algorithms from efficient parallel algorithms. Using this technique,
the huge repository of efficient parallel algorithms can be exploited to obtain
I/O-efficient algorithms for a wide range of problems. Sections 3.4 through 3.7
are dedicated to the discussion of techniques used in I/O-efficient algorithms
for fundamental graph problems. The choice of the graph problems we con-
sider is based on the importance of these problems as tools for solving other
problems that are not of a graph-theoretic nature.
∗
Research supported by Natural Sciences and Engineering Research Council of
Canada. Part of this work was done while the second author was a Ph.D. student
at the School of Computer Science of Carleton University.

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 36-61, 2003.
© Springer-Verlag Berlin Heidelberg 2003
3. A Survey of Techniques for Designing I/O-Eﬃcient Algorithms 37

3.2 Basic Techniques

3.2.1 Scanning

Scanning is the simplest of all paradigms applied in I/O-eﬃcient algorithms.

The idea is that reading and writing data in sequential order is less expensive
than accessing data at random. In particular, N data items, when read in se-
quential order, can be accessed in O(N/B) I/Os, while accessing N data items
at random costs Ω(N ) I/Os in the worst case. We illustrate this paradigm
using a simple example and then derive the general, formal definition of the
paradigm from the discussion of the example.
The example we consider is that of computing the prefix sums of the
elements stored in an array A and storing them in an array A . The straight-
forward algorithm for solving this problem accesses the elements of A one by
one in their order of appearance and adds them to a running sum s, which is
initially set to 0. After reading each element and adding it to s, the current
value of s is appended to array A .
In internal memory, this simple algorithm takes linear time. It scans ar-
ray A, i.e., reads the elements of A in their order of appearance and writes
the elements of array A in sequential order. Whenever data is accessed in se-
quential order like this, we speak of an application of the scanning paradigm.
Looking at it from a different angle, we can consider array A as an input
stream whose elements are processed by the algorithm as they arrive, and
array A is an output stream produced by the algorithm.
In external memory, the algorithm takes O(N/B) I/Os after making the
following simple modifications: At the beginning of the algorithm, instead of
accessing only the first element of A, the first B elements are read into an
input buffer associated with input stream A. This transfer of the first B el-
ements from disk into main memory takes a single I/O. After that, instead
of reading the next element to be processed directly from input stream A,
they are read from the input buffer, which does not incur any I/Os. As soon
as all elements in the input buffer have been processed, the next B elements
are read from the input stream, which takes another I/O, and then the el-
ements to be processed are again retrieved from the input buffer. Applying
this strategy until all elements of A are processed, the algorithm performs
N/B I/O-operations to read its input elements from input stream A. The
writing of the output can be “blocked” in a similar fashion: That is, we as-
sociate an output buffer of size B with output stream A . Instead of writing
the elements of A directly to disk as they are produced, we append these
elements to the output buffer until the buffer is full. When this happens,
we write the content of the buffer to disk, which takes one I/O. Then there
is room in the buffer for the next B elements to be appended to A . Re-
peating this process until all elements of A have been written to disk, the
algorithm performs N/B I/Os to write A . Thus, in total the computation of
38 Anil Maheshwari and Norbert Zeh

First input stream Second input stream

(a) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62

Input buﬀer Input buﬀer

Output buf.

Output stream

(b) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62

2 4 7 8 1 3 5 11

7 8 5 11

1 2 3 4

Fig. 3.1. Merging two sorted sequences. (a) The initial situation: The two lists are
stored on disk. Two empty input buffers and an empty output buffer have been
allocated in main memory. The output sequence does not contain any data yet.
(b) The first block from each input sequence has been loaded into main memory.
(c) The first B elements have been moved from the input buffers to the output
buffer.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 39

(d) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62

7 8 5 11

1 2 3 4

(e) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62

12 16 19 27 11

5 7 8

1 2 3 4

Fig. 3.1. (continued) (d) The contents of the output buffer are flushed to the
output stream to make room for more data to be moved to the output buffer.
(e) After moving elements 5, 7, and 8 to the output buffer, the input buffer for the
first stream does not contain any more data items. Hence, the next block is read
from the first input stream into the input buffer.

array A from array A takes O(N/B) I/Os rather than Θ(N ) I/Os, as would
be required to solve this task using direct disk accesses.
In our example we apply the scanning paradigm to a problem with one
input stream A and one output stream A . It is easy to apply the above
buffering technique to a problem with q input streams S1 , . . . , Sq and r out-
put streams S1 , . . . , Sr , as long as there is enough room to keep an input
buffer of size B per input stream Si and an output buffer of size B per
output stream Sj in internal memory. More precisely, p + q cannot be more
than M/B. Under this assumption the algorithm still takes O(N/B) I/Os,
where N = qi=1 |Si | + rj=1 |Sj |. Note, however, that this analysis includes
only the number of I/Os required to read the elements from the input streams
and write the output to the output streams. It does not include the I/O-
complexity of the actual computation of the output elements from the in-
put elements. One way to guarantee that the I/O-complexity of the whole
algorithm, including all computation, is O(N/B) is to ensure that only the
M −(q +r)B elements most recently read from the input streams are required
40 Anil Maheshwari and Norbert Zeh

for the computation of the next output element, or the required information
about all elements read from the input streams can be maintained succinctly
in M − (q + r)B space. If this can be guaranteed, the computation of all
output elements from the read input elements can be carried out in main
memory and thus does not cause any I/O-operations to be performed.
An important example where the scanning paradigm is applied to more
than one input stream is the merging of k sorted streams to produce a single
sorted output stream (see Fig. 3.1). This procedure is applied repeatedly with
a parameter of k = 2 in the classical internal memory MergeSort algorithm.
The I/O-eﬃcient MergeSort algorithm discussed in the next section takes
advantage of the fact that up to k = M/B streams can be merged in a linear
number of I/Os, in order to decrease the number of recursive merge steps
required to produce a single sorted output stream.

3.2.2 Sorting

Sorting is a fundamental problem that arises in almost all areas of computer

science, including large scale applications such as database systems. Besides
the obvious applications where the task at hand requires per definition that
the output be produced in sorted order, sorting is often applied as a paradigm
for eliminating random disk accesses in external memory algorithms. Consider
for instance a graph algorithm that performs a traversal of the input graph
to compute a labelling of its vertices. This task does not require any part
of the representation of the graph to be sorted, so that the internal memory
algorithm does not necessarily include a sorting step. An external memory
algorithm for the same problem, on the other hand, can benefit greatly by
sorting the vertex set of the graph appropriately. In particular, without sort-
ing the vertex set, the algorithm has no control over the order in which the
vertices are stored on disk. Hence, in the worst case the algorithm spends one
I/O to load each vertex into internal memory when it is visited. If the order
in which the vertices of the graph are visited can be determined efficiently,
it is more efficient to sort the vertices in this order and then perform the
traversal of the graph using a simple scan of the sorted vertex set.
The number of I/O-efficient sorting algorithms that have been proposed in
the literature is too big for us to give an exhaustive survey of these algorithms
at the level of detail appropriate for a tutorial as this one. Hence, we restrict
our discussion to a short description of the two main paradigms applied in
I/O-efficient sorting algorithms and then present the simplest I/O-optimal
sorting algorithm in detail. At the end of this section we discuss a number of
issues that need to be addressed in order to obtain I/O-optimal algorithms
for sorting on multiple disks.
Besides algorithms that delegate the actual work required to sort the
given set of data elements to I/O-efficient data structures, the existing sorting
algorithms can be divided into two categories based on the basic approach
taken to produce the sorted output.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 41

Algorithms based on the merging paradigm proceed in two phases. In

the first phase, the run formation phase, the input data is partitioned into
more or less trivial sorted sequences, called “runs”. In the second phase,
the merging phase, these runs are merged until only one sorted run remains,
where merging k runs S1 , . . . , Sk means that a single sorted run S is produced
that contains all elements of runs S1 , . . . , Sk . The classical internal memory
MergeSort algorithm is probably the simplest representative of this class
of algorithms. The run formation phase is trivial in this case, as it simply
declares every input element to be in a run of its own. The merging phase
uses two-way merging to produce longer runs from shorter ones. That is,
given the current set of runs S1 , . . . , Sk , these runs are grouped into pairs of
runs and each pair is merged to form a longer run. Each such iteration of
merging pairs of runs can be carried out in linear time. The number of runs
reduces by a factor of two from one iteration to the next, so that O(log N )
iterations suffice to produce a single sorted run containing all data elements.
Hence, the algorithm takes O(N log N ) time.
Algorithms based on the distribution paradigm compute a partition of the
given data set S into subsets S0 , . . . , Sk so that for all 0 ≤ i < j ≤ k and
any two elements x ∈ Si and y ∈ Sj , x ≤ y. Given this partition, a sequence
containing the elements of S in sorted order is produced by sorting each of the
sets S0 , . . . , Sk recursively and concatenating the resulting sorted sequences.
In order to produce sets S0 , . . . , Sk , the algorithm chooses a set of splitters
x1 ≤ · · · ≤ xk from S. To simplify the discussion, we also assume that there
are two splitters x0 = −∞ and xk+1 = +∞. Then set Si , 0 ≤ i ≤ k, is
defined as the set of elements y ∈ S so that xi ≤ y < xi+1 . Given the set
of splitters, sets S0 , . . . , Sk are produced by comparing each element in S to
the splitters. The efficiency of this procedure depends on a good choice of the
splitter elements. If it can be guaranteed that each of the sets S0 , . . . , Sk has
size O(|S|/k), the procedure finishes in O(N log2 N ) time because O(logk N )
levels of recursion suffice to produce a partition of the input into N singleton
sets that are arranged in the right order, and each level of recursion takes
O(N log k) time to compare each element in S to the splitters. QuickSort
is a representative of this class of algorithms.
In internal memory MergeSort has the appealing property of being
simpler than QuickSort because the run formation phase is trivial and the
merging phase is a simple iterative process that requires only little book-
keeping. In contrast, QuickSort faces the problem of computing a good
splitter, which is easy using randomization, but requires some effort if done
deterministically. In external memory the situation is not much different, as
long as the goal is to sort optimally on a single disk. If I/O-optimal perfor-
mance is to be achieved on multiple disks, distribution-based algorithms are
preferable because it is somewhat easier to make them take full advantage of
the parallel nature of multi-disk systems. Since we do not go into too much
detail discussing the issues involved in optimal sorting on multiple disks, we
42 Anil Maheshwari and Norbert Zeh

choose an I/O-eﬃcient version of MergeSort as the sorting algorithm we

discuss in detail. The algorithm is simple, resembles very much the internal
memory algorithm, and achieves optimal performance on a single disk.
In order to see what needs to be done to obtain a MergeSort algo-
rithm that performs O((N/B) logM/B (N/B)) I/Os, let us analyze the I/O-
complexity of the internal memory MergeSort algorithm as it is. The run
formation phase does not require any I/Os. Merging two sorted runs S1 and
S2 takes O(1 + (|S1 | + |S2 |)/B) I/Os using the scanning paradigm: Read the
first element from each of the two streams. Let x and y be the two read
elements. If x < y, place x into the output stream, read the next element
from S1 , and repeat the whole procedure. If x ≥ y, place y into the out-
put stream, read the next element from S2 , and repeat the whole procedure.
Since every input element is involved in O(log N ) merge steps, and the total
number of merge steps is O(N ), the I/O-complexity of the internal memory
MergeSort algorithm is O(N + (N/B) log2 N ).
The I/O-complexity of the algorithm can be reduced to O((N/B)·
log2 (N/B)) by investing O(N/B) I/Os during the run formation phase. In
particular, the run formation phase makes sure that the merge phase starts
out with N/M sorted runs of length M instead of N singleton runs. To
achieve this goal, the data is partitioned into N/M chunks of size M . Then
each chunk is loaded into main memory, sorted internally, and written back to
disk in sorted order. Reading and writing each chunk takes O(M/B) I/Os, so
that the total I/O-complexity of the run formation phase is O(N/M ·M/B) =
O(N/B). As a result of reducing the number of runs to N/M , the merge phase
now takes O(N/M + (N/B) log2 (N/M )) = O((N/B) log2 (N/B)) I/Os.
In order to increase the base of the logarithm to M/B, it has to be ensured
that the number of runs reduces by a factor of Ω(M/B) from one iteration of
the merge phase to the next because then O(logM/B (N/B)) iterations suffice
to produce a single sorted run. In order to achieve this goal, the obvious
thing to do is to merge k = M/(2B) runs S1 , . . . , Sk instead of only two
runs in a single merge step. Similar to the internal memory merge step, the
algorithm loads the first elements x1 , . . . , xk from runs S1 , . . . , Sk into main
memory, copies the smallest of them, xi , to the output run, reads the next
element from Si and repeats the whole procedure. The I/O-complexity of this
k
modified merge step is O(k + ( i=1 |Si |)/B) because the available amount of
main memory suffices to allocate a buffer of size B for each input and output
run, so that the scanning paradigm can be applied.
Using this modified merge step, the number of runs reduces by a factor of
M/(2B) from one iteration of the merge phase to the next. Hence, the algo-
rithm produces a single sorted run already after O(logM/B (N/B)) iterations,
and the I/O-complexity of the algorithm becomes O((N/B) logM/B (N/B)),
as desired.
An issue that does not affect the I/O-complexity of the algorithm, but is
important to obtain a fast algorithm, is how the merging of k runs instead
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 43

of two runs is done in internal memory. When merging two runs, choosing
the next element to be moved to the output run involves a single compar-
ison. When merging k > 2 runs, it becomes computationally too expensive
to ﬁnd the minimum of elements x1 , . . . , xk in O(k) time because then the
running time of the merge phase would be O(kN logk (N/B)). In order to
achieve optimal running time in internal memory as well, the minimum of
elements x1 , . . . , xk has to be found in O(log k) time. This can be achieved by
maintaining the smallest elements x1 , . . . , xk , one from each run, in a priority
queue. The next element to be moved to the output run is the smallest in the
priority queue and can hence be retrieved using a DeleteMin operation. Let
the retrieved element be xi ∈ Si . Then after moving xi to the output run,
the next element is read from Si and inserted into the priority queue, which
guarantees that again the smallest unprocessed element from every run is
stored in the priority queue. This process is repeated until all elements have
been moved to the output run. The amount of space used by the priority
queue is O(k) = O(M/B), so that the priorty queue can be maintained in
main memory. Moving one element to the output run involves the execution
of one DeleteMin and one Insert operation on the priority queue, which
takes O(log k) time. Hence, the total running time of the MergeSort al-
gorithm is O(N log M + (N log k) logk (N/B)) = O(N log N ). We summarize
the discussion in the following theorem.

Theorem 3.1. [17] A set of N elements can be sorted using

O((N/B) logM/B (N/B)) I/Os and O(N log N ) internal memory computa-
tion time.

Sorting optimally on multiple disks is a more challenging problem. The

challenge with distribution-based algorithms is to distribute the blocks of
the buckets approximately evenly across the D disks while at the same time
making sure that every I/O-operation writes Ω(D) blocks to disk. The lat-
ter requirement needs to be satisﬁed to guarantee that the algorithm takes
full advantage of the parallel disks (up to a constant factor) during write
operations. The former requirement is necessary to guarantee that the re-
cursive invocation of the sorting algorithm can utilize the full bandwidth of
the parallel I/O-system when reading the data elements to be sorted. Vitter
and Shriver [755] propose randomized online techniques that guarantee that
with high probability each bucket is distributed evenly across the D disks.
The balancing algorithm applies the classical result that if α balls are placed
uniformly at random into β bins and α = Ω(β log β), then all bins contain ap-
proximately the same number of balls, with high probability. The balancing
algorithm maps the blocks of each bucket uniformly at random to disks where
they are to be stored. Viewing the blocks in a bucket as balls and the disks
as bins, and assuming that the number of blocks is suﬃciently larger than
the number of disks, the blocks in each bucket are distributed evenly across
the D disks, with high probability. A number of deterministic methods for
44 Anil Maheshwari and Norbert Zeh

performing distribution sort on multiple disks have also been proposed, in-
cluding BalanceSort [586], sorting using the buffer tree [52], and algorithms
obtained by simulating bulk-synchronous parallel sorting algorithms [244].
The reader may refer to these references for details.
For merge sort, it is required that each iteration in the merging phase
is carried out in O(N/(DB)) I/Os. In particular, each read operation must
bring Ω(D) blocks of data into main memory, and each write operation must
write Ω(D) blocks to disk. While the latter is easy to achieve, reading blocks
in parallel is difficult because the runs to be merged were formed in the
previous iteration without any knowledge about how they would interact with
other runs in subsequent merge operations. Nodine and Vitter [587] propose
an optimal deterministic merge sort for multiple disks. The algorithm first
performs an approximate merge phase that guarantees that no element is too
far away from its final location. In the second phase, each element is moved
to its final location. Barve et al. [92, 93] claim that their sorting algorithm
is the most practical one. Using their approach, each run is striped across
the disks, with a random starting disk. When merging runs, the next block
needed from each disk is read into main memory. If there is not sufficient
room in main memory for all the blocks to be read, then the least needed
blocks are discarded from main memory (without incurring any I/Os). They
derive asymptotic upper bounds on the expected I/O complexity of their
algorithm.

3.3 Simulation of Parallel Algorithms in External

Memory
Blockwise data access is a central theme in the design of I/O-efficient algo-
rithms. A second important issue, when more than one disk is present, is fully
parallel disk I/O. A number of techniques have been proposed that address
this issue by simulating parallel algorithms as external memory algorithms.
Most notably, Atallah and Tsay [74] discuss how to derive I/O-efficient al-
gorithms from parallel algorithms for mesh architectures, Chiang et al. [192]
discuss how to obtain I/O-efficient algorithms from PRAM algorithms, and
Sibeyn and Kaufmann [692], Dehne et al. [242, 244], and Dittrich et al. [254]
discuss how to simulate coarse-grained parallel algorithms developed for the
BSP, CGM, and BSP∗ models in external memory. In this section we discuss
the simulation of PRAM algorithms in external memory, which has been pro-
posed by Chiang et al. [192]. For a discussion of other simulation results see
Chapter 15.
The PRAM simulation of [192] is particularly appealing as it translates
the large number of PRAM-algorithms described in the literature into I/O-
efficient and sometimes I/O-optimal algorithms, including algorithms for a
large number of graph and geometric problems, such as connectivity, com-
puting minimum spanning trees, planarity testing and planar embedding,
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 45

computing convex hulls, Voronoi diagrams, and triangulations of point sets

in the plane.
In order to describe the simulation paradigm of [192], let us quickly re-
call the definition of a PRAM. A PRAM consists of a number of RAM-type
processing units that operate synchronously and share a global memory. Dif-
ferent processors can exchange information by reading and writing the same
memory cell in the shared memory. The two main measures of performance
of an algorithm in the PRAM-model are its running time and the amount of
work it performs. The former is defined as the maximum number of compu-
tation steps performed by any of the processors. The latter is the product of
the running time of the algorithm times the number of processors.
Now consider a PRAM-algorithm A that uses N processors and
O(N ) space and runs in O(T (N )) time. To simulate the computation of
algorithm A, assume that the processor contexts (O(1) state information per
processor) and the content of the shared memory are stored on disk in a
suitable format. Assume furthermore that every computation step of a pro-
cessor consists of a constant number of read accesses to shared memory (to
retrieve the operands of the computation step), followed by O(1) computa-
tion, and a constant number of write accesses to shared memory (to write the
results back to memory). Then one step of algorithm A can be simulated in
O(sort(N )) I/Os as follows: First scan the list of processor contexts to trans-
late the read accesses each processor intends to perform into read requests
that are written to disk. Then sort the resulting list of read requests by the
memory locations they access. Scan the sorted list of read requests and the
memory representation to augment every read request with the content of the
memory cell it addresses. Sort the list of read requests again, this time by the
issuing processor, and finally scan the sorted list of read requests and the list
of processor contexts to transfer the requested operands to each processor.
The computation of all processors can now be simulated in a single scan over
the processor contexts. Writing the results of the computation to the shared
memory can be simulated in a manner similar to the simulation of reading
the operands.
The simulation of read and write accesses to the shared memory requires
O(1) scans of the list of processor contexts, O(1) scans of the representation
of the shared memory, and sorting and scanning the lists of read and write
requests a constant number of times. Since all these lists have size O(N ),
this takes O(sort(N )) I/Os. As argued above, the computation itself can be
carried out in O(scan(N )) I/Os. Hence, a single step of algorithm A can be
simulated in O(sort(N )) I/Os, so that simulating the whole algorithm takes
O(T (N ) · sort(N )) I/Os.

Theorem 3.2. [192] A PRAM algorithm that uses N processors and

O(N ) space and runs in time T (N ) can be simulated in O(T (N ) · sort(N ))
I/Os.
46 Anil Maheshwari and Norbert Zeh

∗ 48

+ − 8 6

/ ∗ 7 1 2 6 7 1

4 2 2 3 4 2 2 3

(a) (b)

Fig. 3.2. (a) The expression tree for the expression ((4 / 2) + (2 ∗ 3)) ∗ (7 − 1).
(b) The same tree with its vertices labelled with their values..

An interesting situation arises when the number of active processors and

the amount of data to be processed decrease geometrically. That is, after a
constant number of steps, the number of active processors and the amount of
processed data decrease by a constant factor. Then the data and the contexts
of the active processors can be compacted after each PRAM step, so that the
I/O-complexity of simulating the steps of the algorithm is also geometrically
decreasing. This implies that the I/O-complexity of the algorithm is domi-
nated by the complexity of simulating the ﬁrst step of the algorithm, which
is O(sort(N )).

3.4 Time-Forward Processing

Time-forward processing [52, 192] is an elegant technique for solving problems
that can be expressed as a traversal of a directed acyclic graph (DAG) from its
sources to its sinks. Problems of this type arise mostly in I/O-efficient graph
algorithms, even though applications of this technique for the construction
of I/O-efficient data structures are also known. Formally, the problem that
can be solved using time-forward processing is that of evaluating a DAG G:
Let φ be an assignment of labels φ(v) to the vertices of G. Then the goal is
to compute another labelling ψ of the vertices of G so that for every vertex
v ∈ G, label ψ(v) can be computed from labels φ(v) and ψ(u1 ), . . . , ψ(uk ),
where u1 , . . . , uk are the in-neighbors of v.
As an illustration, consider the problem of expression-tree evaluation (see
Fig. 3.2). For this problem, the input is a binary tree T whose leaves store
real numbers and whose internal vertices are labelled with one of the four
elementary binary operations +, −, ∗, /. The value of a vertex is defined re-
cursively. For a leaf v, its value val (v) is the real number stored at v. For an
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 47

internal vertex v with label ◦ ∈ {+, −, ∗, /}, left child x, and right child y,
val (v) = val (x) ◦ val (y). The goal is to compute the value of the root of T .
Cast in terms of the general DAG evaluation problem defined above, tree T
is a DAG whose edges are directed from children to parents, labelling φ is the
initial assignment of real numbers to the leaves of T and of operations to the
internal vertices of T , and labelling ψ is the assignment of the values val (v)
to all vertices v ∈ T . For every vertex v ∈ T , its label ψ(v) = val (v) is com-
puted from the labels ψ(x) = val (x) and ψ(y) = val (y) of its in-neighbors
(children) and its own label φ(v) ∈ {+, −, ∗, /}.
In order to be able to evaluate a DAG G I/O-efficiently, two assumptions
have to be satisfied: (1) The vertices of G have to be stored in topologically
sorted order. That is, for every edge (v, w) ∈ G, vertex v precedes vertex w.
(2) Label ψ(v) has to be computable from labels φ(v) and ψ(u1 ), . . . , ψ(uk )
in O(sort(k)) I/Os. The second condition is trivially satisfied if every vertex
of G has in-degree no more than M .
Given these two assumptions, time-forward processing visits the vertices
of G in topologically sorted order to compute labelling ψ. Visiting the vertices
of G in this order guarantees that for every vertex v ∈ G, its in-neighbors are
evaluated before v is evaluated. Thus, if these in-neighbors “send” their labels
ψ(u1 ), . . . , ψ(uk ) to v, v has these labels and its own label φ(v) at its disposal
to compute ψ(v). After computing ψ(v), v sends its own label ψ(v) “forward
in time” to its out-neighbors, which guarantees that these out-neighbors have
ψ(v) at their disposal when it is their turn to be evaluated.
The implementation of this technique due to Arge [52] is simple and ele-
gant. The “sending” of information is realized using a priority queue Q (see
Chapter 2 for a discussion of priority queues). When a vertex v wants to send
its label ψ(v) to another vertex w, it inserts ψ(v) into priority queue Q and
gives it priority w. When vertex w is evaluated, it removes all entries with
priority w from Q. Since every in-neighbor of w sends its label to w by queu-
ing it with priority w, this provides w with the required inputs. Moreover,
every vertex removes its inputs from the priority queue before it is evaluated,
and all vertices with smaller numbers are evaluated before w. Thus, at the
time when w is evaluated, the entries in Q with priority w are those with
lowest priority, so that they can be removed using a sequence of DeleteMin
operations.
Using the buffer tree of Arge [52] to implement priority queue Q, In-
sert and DeleteMin operations on Q can be performed in O((1/B)·
logM/B (|E|/B)) I/Os amortized because priority queue Q never holds more
than |E| entries. The total number of priority queue operations performed by
the algorithm is O(|E|), one Insert and one DeleteMin operation per edge.
Hence, all updates of priority queue Q can be processed in O(sort(|E|)) I/Os.
The computation of labels ψ(v) from labels φ(v) and ψ(u1 ), . . . , ψ(uk ), for
all vertices v ∈ G, can also be carried out in O(sort(|E|)) I/Os, using the
48 Anil Maheshwari and Norbert Zeh

above assumption that this computation takes O(sort(k)) I/Os for a single
vertex v. Hence, we obtain the following result.

Theorem 3.3. [52, 192] Given a DAG G = (V, E) whose vertices are
stored in topologically sorted order, graph G can be evaluated in O(sort(|V | +
|E|)) I/Os, provided that the computation of the label of every vertex v ∈ G
can be carried out in O(sort(deg− (v))) I/Os, where deg− (v) is the in-degree
of vertex v.

3.5 Greedy Graph Algorithms

In this section we describe a simple technique proposed in [775] that can

be used to make internal memory graph algorithms of a sufficiently simple
structure I/O-efficient. For this technique to be applicable, the algorithm
has to compute a labelling of the vertices of the graph, and it has to do
so in a particular way. We call a vertex labelling algorithm A single-pass
if it computes the desired labelling λ of the vertices of the graph by visit-
ing every vertex exactly once and assigns label λ(v) to v during this visit.
We call A local if label λ(v) can be computed in O(sort(k)) I/Os from la-
bels λ(u1 ), . . . , λ(uk ), where u1 , . . . , uk are the neighbors of v whose labels
are computed before λ(v). Finally, algorithm A is presortable if there is an
algorithm that takes O(sort(|V | + |E|)) I/Os to compute an order of the ver-
tices of the graph so that A produces a correct result if it visits the vertices
of the graph in this order. The technique we describe here is applicable if
algorithm A is presortable, local, and single-pass.
So let A be a presortable local single-pass vertex-labelling algorithm com-
puting some labelling λ of the vertices of a graph G = (V, E). In order to
make algorithm A I/O-efficient, the two main problems are to determine
an order in which algorithm A should visit the vertices of G and devise a
mechanism that provides every vertex v with the labels of its previously vis-
ited neighbors u1 , . . . , uk . Since algorithm A is presortable, there exists an
algorithm A that takes O(sort(|V | + |E|)) I/Os to compute an order of the
vertices of G so that algorithm A produces the correct result if it visits the
vertices of G in this order. Assume w.l.o.g. that this ordering of the vertices
of G is expressed as a numbering. We use algorithm A to number the vertices
of G and then derive a DAG G from G by directing every edge of G from the
vertex with smaller number to the vertex with larger number. DAG G has
the property that for every vertex v, the in-neighbors of v in G are exactly
those neighbors of v that are labelled before v. Hence, labelling λ can be
computed using time-forward processing. In particular, by the locality of A,
the label λ(v) of every vertex can be computed in O(sort(k)) I/Os from the
labels λ(u1 ), . . . , λ(uk ) of its in-neighbors u1 , . . . , uk in G , which is a simpli-
fied version of the condition for the applicability of time-forward processing.
This leads to the following result.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 49

Theorem 3.4. [775] Every graph problem P that can be solved by a pre-
sortable local single-pass vertex labelling algorithm can be solved in O(sort(|V |+
|E|)) I/Os.

An important observation to be made is that in this application of time-

forward processing, the restriction that the vertices of the DAG to be evalu-
ated have to be given in topologically sorted order does not pose a problem
because the directions of the edges are chosen only after ﬁxing an order of
the vertices that is to be the topological order.
To illustrate the power of Theorem 3.4, we apply it below to obtain deter-
ministic O(sort(|V |+|E|)) I/O algorithms for ﬁnding a maximal independent
set of a graph G and coloring a graph of degree ∆ with ∆ + 1 colors. In [775],
the approach is applied in a less obvious manner in order to compute a maxi-
mal matching of a graph G = (V, E) in O(sort(|V | + |E|)) I/Os. The problem
with computing a maximal matching is that it is an edge labelling problem.
However, Zeh [775] shows that it can be transformed into a vertex labelling
problem of a graph with |E| vertices and at most 2|E| edges.

3.5.1 Computing a Maximal Independent Set

In order to compute a maximal independent set S of a graph G = (V, E)

in internal memory, the following simple algorithm can be used: Process the
vertices in an arbitrary order. When a vertex v ∈ V is visited, add it to S if
none of its neighbors is in S. Translated into a labelling problem, the goal is
to compute the characteristic function χS : V → {0, 1} of S, where χS (v) = 1
if v ∈ S, and χS (v) = 0 if v ∈ S. Also note that if S is initially empty, then
any neighbor w of v that is visited after v cannot be in S at the time when
v is visited, so that it is sufficient for v to inspect all its neighbors that are
visited before v to decide whether or not v should be added to S. The result of
these modifications is a vertex-labelling algorithm that is presortable (since
the order in which the vertices are visited is unimportant), local (since only
previously visited neighbors of v are inspected to decide whether v should be
added to S, and a single scan of labels χS (u1 ), . . . , χS (uk ) suffices to do so),
and single-pass. This leads to the following result.

Theorem 3.5. [775] Given an undirected graph G = (V, E), a maximal in-
dependent set of G can be found in O(sort(|V | + |E|)) I/Os and linear space.

3.5.2 Coloring Graphs of Bounded Degree

The algorithm to compute a (∆ + 1)-coloring of a graph G whose vertices

have degree bounded by some constant ∆ is similar to the algorithm for com-
puting a maximal independent set presented in the previous section: Process
the vertices in an arbitrary order. When a vertex v ∈ V is visited, assign a
50 Anil Maheshwari and Norbert Zeh

color c(v) ∈ {1, . . . , ∆+1} to vertex v that has not been assigned to any neigh-
bor of v. The algorithm is presortable and single-pass for the same reasons as
the maximal independent set algorithm. The algorithm is local because the
color of v can be determined as follows: Sort the colors c(u1 ), . . . , c(uk ) of v’s
in-neighbors u1 , . . . , uk . Then scan this list and assign the ﬁrst color not in
this list to v. This takes O(sort(k)) I/Os.

Theorem 3.6. [775] Given an undirected graph G = (V, E) whose ver-

tices have degree at most ∆, a (∆ + 1)-coloring of G can be computed in
O(sort(|V | + |E|)) I/Os and linear space.

3.6 List Ranking and the Euler Tour Technique

List ranking and the Euler tour technique are two techniques that have been
applied successfully in the design of PRAM algorithms for labelling problems
on lists and rooted trees and problems that can be reduced eﬃciently to
one of these problems. Given the similarity of the issues to be addressed in
parallel and external memory algorithms, it is not surprising that the same
two techniques can be applied in I/O-eﬃcient algorithms as well.

3.6.1 List Ranking

Let L be a linked list, i.e., a collection of vertices x1 , . . . , xN such that each

vertex xi , except the tail of the list, stores a pointer succ(xi ) to its successor
in L, no two vertices have the same successor, and every vertex can reach the
tail of L by following successor pointers. Given a pointer to the head of the
list (i.e., the vertex that no other vertex in the list points to), the list ranking
problem is that of computing for every vertex xi of list L, its distance from
the head of L, i.e., the number of edges on the path from the head of L to xi .
In internal memory this problem can easily be solved in linear time us-
ing the following algorithm: Starting at the head of the list, follow succes-
sor pointers and number the vertices of the list from 0 to N − 1 in the
order they are visited. Often we use the term “list ranking” to denote the
following generalization of the list ranking problem, which is solvable in
linear time using a straightforward generalization of the above algorithm:
Given a function λ : {x1 , . . . , xN } → X assigning labels to the vertices
of list L and a multiplication ⊗ : X × X → X defined over X,
compute

a labelφ(xi ) for each
vertex
xi of L such that φ x σ(1) = λ xσ(1) and
φ xσ(i) = φ xσ(i−1) ⊗ λ xσ(i) , for 1 < i ≤ N , whereσ : [1, N ] → [1, N ] is
a permutation so that xσ(1) is the head of L and succ xσ(i) = xσ(i+1) , for
1 ≤ i < N.
Unfortunately the simple internal memory algorithm is not I/O-efficient:
Since we have no control over the physical order of the vertices of L on disk,
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 51

an adversary can easily arrange the vertices of L in a manner that forces the
internal memory algorithm to perform one I/O per visited vertex, so that
the algorithm performs Ω(N ) I/Os in total. On the other hand, the lower
bound for list ranking shown in [192] is only Ω(perm(N )). Next we sketch
a list ranking algorithm proposed in [192] that takes O(sort(N )) I/Os and
thereby closes the gap between the lower and the upper bound.
We make the simplifying assumption that multiplication over X is as-
sociative. If this is not the case, we determine the distance of every vertex
from the head of L, sort the vertices of L by increasing distances, and then
compute the prefix product using the internal memory algorithm. After ar-
ranging the vertices by increasing distances from the head of L, the internal
memory algorithm takes O(scan(N )) I/Os. Hence, the whole procedure still
takes O(sort(N )) I/Os, and the associativity assumption is not a restriction.
Given that multiplication over X is associative, the algorithm of [192]
uses graph contraction to rank list L as follows: First an independent set I
of L is found so that |I| = Ω(N ). Then the elements in I are removed from L.
That is, for every element x ∈ I with predecessor y and successor z in L, the
successor pointer of y is updated to succ(y) = z. The label of x is multiplied
with the label of z, and the result is assigned to z as its new label in the
compressed list. It is not hard to see that the weighted ranks of the elements
in L−I remain the same after adjusting the labels in this manner. Hence, their
ranks can be computed by applying the list ranking algorithm recursively to
the compressed list. Once the ranks of all elements in L − I are known, the
ranks of the elements in I are computed by multiplying their labels with the
ranks of their predecessors in L.
If the algorithm excluding the recursive invocation on the compressed list
takes O(sort(N )) I/Os, the total I/O-complexity of the algorithm is given by
the recurrence I(N ) = I(cN )+O(sort(N )), for some constant 0 < c < 1. The
solution of this recurrence is O(sort(N )). Hence, we have to argue that every
step, except the recursive invocation, can be carried out in O(sort(N )) I/Os.
Given independent set I, it suffices to sort the vertices in I by their suc-
cessors and the vertices in L − I by their own IDs, and then scan the resulting
two sorted lists to update the weights of the successors of all elements in I.
The successor pointers of the predecessors of all elements in I can be updated
in the same manner. In particular, it suffices to sort the vertices in L − I by
their successors and the vertices in I by their own IDs, and then scan the two
sorted lists to copy the successor pointer from each vertex in I to its prede-
cessor. Thus, the construction of the compressed list takes O(sort(N )) I/Os,
once set I is given.
In order to compute the independent set I, Chiang et al. [192] apply
a 3-coloring procedure for lists, which applies time-forward processing to
“monotone” sublists of L and takes O(sort(N )) I/Os; the largest monochro-
matic set is chosen to be set I. Using the maximal independent set algorithm
of Section 3.5.1, a large independent set I can be obtained more directly in
52 Anil Maheshwari and Norbert Zeh

the same number of I/Os because a maximal independent set of a list has
size at least N/3. Thus, we have the following result.
Theorem 3.7. [192] A list of length N can be ranked in O(sort(N )) I/Os.
List ranking alone is of very limited use. However, combined with the
Euler tour technique described in the next section, it becomes a very powerful
tool for solving problems on trees that can be expressed as functions over a
traversal of the tree or problems on general graphs that can be expressed in
terms of a traversal of a spanning tree of the graph. An important application
is the rooting of an undirected tree T , which is the process of directing all
edges of T from parents to children after choosing one vertex of T as the root.
Given a rooted tree T (i.e., one where all edges are directed from parents to
children), the Euler tour technique and list ranking can be used to compute
a preorder or postorder numbering of the vertices of T , or the sizes of the
subtrees rooted at the vertices of T . Such labellings are used in many classical
graph algorithms, so that the ability to compute them is a ﬁrst step towards
solving more complicated graph problems.

3.6.2 The Euler Tour Technique

An Euler tour of a tree T = (V, E) is a traversal of T that traverses every

edge exactly twice, once in each direction. Such a traversal is useful, as it
produces a linear list of vertices or edges that captures the structure of the
tree. Hence, it allows standard parallel or external memory algorithms to be
applied to this list, in order to solve problems on tree T that can be expressed
as some function to be evaluated over the Euler tour.
Formally, the tour is represented as a linked list L whose elements are
the edges in the set {(v, w), (w, v) : {v, w} ∈ E} and so that for any two
consecutive edges e1 and e2 , the target of e1 is the source of e2 . In order to
define an Euler tour, choose a circular order of the edges incident to each
vertex of T . Let {v, w1 }, . . . , {v, wk } be the edges incident to vertex v. Then
let succ((wi , v)) = (v, wi+1 ), for 1 ≤ i < k, and succ((wk , v)) = (v, w1 ). The
result is a circular linked list of the edges in T . Now an Euler tour of T
starting at some vertex r and returning to that vertex can be obtained by
choosing an edge (v, r) with succ((v, r)) = (r, w), setting succ((v, r)) = null,
and choosing (r, w) as the first edge of the traversal.
List L can be computed from the edge set of T in O(sort(N )) I/Os:
First scan set E to replace every edge {v, w} with two directed edges (v, w)
and (w, v). Then sort the resulting set of directed edges by their target ver-
tices. This stores the incoming edges of every vertex consecutively. Hence, a
scan of the sorted edge list now suffices to compute the successor of every
edge in L.
Lemma 3.8. An Euler tour L of a tree with N vertices can be computed in
O(sort(N )) I/Os.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 53

Given an unrooted (and undirected) tree T , choosing one vertex of T as

the root defines a direction on the edges of T by requiring that every edge be
directed from the parent to the child. The process of rooting tree T is that of
computing these directions explicitly for all edges of T . To do this, we con-
struct an Euler tour starting at an edge (r, v) and compute the rank of every
edge in the list. For every pair of opposite edges (u, v) and (v, u), we call the
edge with the lower rank a forward edge, and the other a back edge. Now it suf-
fices to observe that for any vertex x = r in T , edge (parent (x), x) is traversed
before edge (x, parent (x)) by any Euler tour starting at r. Hence, for every
pair of adjacent vertices x and parent (x), edge (parent (x), x) is a forward
edge, and edge (x, parent (x)) is a back edge. That is, the set of forward edges
is the desired set of edges directed from parents to children. Constructing
and ranking an Euler tour starting at the root r takes O(sort(N )) I/Os, by
Theorem 3.7 and Lemma 3.8. Given the ranks of all edges, the set of forward
edges can be extracted by sorting all edges in L so that for any two adjacent
vertices v and w, edges (v, w) and (w, v) are stored consecutively and then
scanning this sorted edge list to discard the edge with higher rank from each
of these edge pairs. Hence, a tree T can be rooted in O(sort(N )) I/Os.
Instead of discarding back edges, it may be useful to keep them, but
tag every edge of the Euler tour L as either a forward or back edge. Using
this information, well-known labellings of the vertices of T can be computed
by ranking list L after assigning appropriate weights to the edges of L. For
example, consider the weighted ranks of the edges in L after assigning weight
one to every forward edge and weight zero to every back edge. Then the
preorder number of every vertex v = r in T is one more than the weighted
rank of the forward edge with target v; the preorder number of the root r is
always one. The size of the subtree rooted at v is one more than the difference
between the weighted ranks of the back edge with source v and the forward
edge with target v. To compute a postorder numbering, we assign weight zero
to every forward edge and weight one to every back edge. Then the postorder
number of every vertex v = r is the weighted rank of the back edge with
source v. The postorder number of the root r is always N .
After labelling every edge in L as a forward or back edge, the appropriate
weights for computing the above labellings can be assigned in a single scan
of list L. The weighted ranks can then be computed in O(sort(N )) I/Os, by
Theorem 3.7. Extracting preorder and postorder numbers from these ranks
takes a single scan of list L again. To extract the sizes of the subtrees rooted
at the vertices of T , we sort the edges in L so that opposite edges with the
same endpoints are stored consecutively. Then a single scan of this sorted
edge list suffices to compute the size of the subtree rooted at every vertex v.
Hence, all these labels can be computed in O(sort(N )) I/Os for a tree with
N vertices.
54 Anil Maheshwari and Norbert Zeh

3.7 Graph Blocking

The final topic we consider is that of blocking graphs. In particular, we are
interested in laying out graphs on disk so that traversals of paths in these
graphs cause as few page faults as possible, using an appropriate paging
algorithm that needs to be specified along with the graph layout. We make
two assumptions the first of which makes the design of suitable layouts easier,
while the second makes it harder. The first assumption we make is that the
graph to be stored on disk is static, i.e., does not change. Hence, it is not
necessary to be able to update the graph I/O-efficiently, so that redundancy
can be used to obtain better layouts. That is, some or all of the vertices of the
graph may be stored in more than one location on disk. In order to visit such a
vertex v, it suffices to visit any of these copies. This gives the paging algorithm
considerable freedom in choosing the “best” of all disk blocks containing a
copy of v as the one to be brought into main memory. By choosing the right
block, the paging algorithm can guarantee that the next so many steps along
the path do not cause any page faults. The second assumption we make is
that the paths are traversed in an online fashion. That is, the traversed path
is constructed only while it is traversed and is not known in advance. This
allows an adversary to choose the worst possible path based on the previous
decisions made by the paging algorithm, and the paging algorithm has to
be able to handle such adversarial behavior gracefully. That is, it has to
minimize the number of page faults in the worst case without having any a
priori knowledge about the traversed path.
Besides the obvious applications where the problem to be solved is a graph
problem and the answer to a query consists of a traversal of a path in the
graph, the theory of graph blocking can be applied in order to store pointer-
based data structures on disk so that queries on these data structures can
be answered I/O-efficiently. In particular, a pointer-based data structure can
be viewed as a graph with additional information attached to its vertices.
Answering a query on such a data structure often reduces to traversing a
path starting at a specified vertex in the data structure. For example, laying
out binary search trees on disk so that paths can be traversed I/O-efficiently,
one arrives at a layout which bears considerable resemblance to a B-tree.
As discussed in Chapter 2, the layout of lists discussed below needs only few
modifications to be transformed into an I/O-efficient linked list data structure
that allows insertions and deletions, i.e., is no longer static.
Since we allow redundant representations of the graph, the two main
measures of performance for a given blocking and the used paging algorithm
are the number of page faults incurred by a path traversal in the worst case
and the amount of space used by the graph representation. Clearly, in order
to store a graph with N vertices on disk, at least N/B blocks of storage are
required. We define the storage blow-up of a graph blocking to be β if it uses
βN/B blocks of storage to store the graph on disk. Since space usage is a
serious issue with large data sets, the goal is to design graph blockings that
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 55

minimize the storage blow-up and at the same time minimize the number of
page faults incurred by a path traversal. Often there is a trade-oﬀ. That is, no
blocking manages to minimize both performance measures at the same time.
In this section we restrict our attention to graph layouts with constant storage
blow-up and bound the worst-case number of page faults achievable by these
layouts using an appropriate paging algorithm. Throughout this section we
denote the length of the traversed path by L. The traversal of such a path
requires at least L/B I/Os in any graph because at most B vertices can be
brought into main memory in a single I/O-operation.
The graphs we consider include lists, trees, grids and planar graphs. The
blocking for planar graphs generalizes to any class of graphs with small sep-
arators. The results presented here are described in detail in the papers of
Nodine et al. [585], Hutchinson et al. [419], and Agarwal et al. [7].
Blocking Lists. The natural approach for blocking a list is to store the
vertices of the list in an array, sorted in their order of appearance along the
list. The storage blow-up of this blocking is one (i.e., there is no blow-up
at all). Since every vertex is stored exactly once in the array, the paging
algorithm has no choice about the block to be brought into main memory
when a vertex is visited. Still, if the traversed path is simple (i.e., travels
along the list in only one direction), the traversal of a path of length L incurs
only L/B page faults. To see this, assume w.l.o.g. that the path traverses
the list in forward direction, i.e., the vertices are visited in the same order as
they are stored in the array, and consider a vertex v in the path that causes
a page fault. Then v is the ﬁrst vertex in the block that is brought into main
memory, and the B − 1 vertices succeeding v in the direction of the traversal
are stored in the same block. Hence, the traversal of any simple path causes
one page fault every B steps along the path.
If the traversed path is not simple, there are several alternatives. Assuming
that M ≥ 2B, the same layout as for simple paths can be used; but the
paging algorithm has to be changed somewhat. In particular, when a page
fault occurs at a vertex v, the paging algorithm has to make sure that the
block brought into main memory does not replace the block containing the
vertex u visited just before v. Using this strategy, it is again guaranteed that
after every page fault, at least B − 1 steps are required before the next page
fault occurs. Indeed, the block containing vertex v contains all vertices that
can be reached from v in B − 1 steps by continuing the traversal in the same
direction, and the block containing vertex u contains all vertices that can
be reached from v in B steps by continuing the traversal in the opposite
direction. Hence, traversing a path of length L incurs at most L/B page
faults.
In the pathological situation that M = B (i.e., there is room for only one
block in main memory) and given the layout described above, an adversary
can construct a path whose traversal causes a page fault at every step. In
particular, the adversary chooses two adjacent vertices v and w that are in
56 Anil Maheshwari and Norbert Zeh

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16

v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14

Fig. 3.3. A layout of a list on a disk with block size B = 4. The storage blow-up
of the layout is two.

diﬀerent blocks and then constructs the path P = (v, w, v, w, . . . ). Whenever

vertex v is visited, the block containing v is brought into main memory,
thereby overwriting the block containing w. When vertex w is visited, the
whole process is reversed. The following layout with storage blow-up two
thwarts the adversary’s strategy: Instead of having only one array containing
the vertices of the list, create a second array storing the vertices of the list in
the same order, but with the block boundaries offset by B/2 (see Fig. 3.3).
To use this layout efficiently, the paging algorithm has to change as follows:
Assume that the current block is from the first array. When a page fault
occurs, the vertex v to be visited is the last vertex in the block preceding
the current block in the first array or the first vertex in the block succeeding
the current block. Since the blocks in the second array are offset by B/2,
this implies that v is at least B/2 − 1 steps away from the border of the
block containing v in the second array. Hence, the paging algorithm loads
this block into memory because then at least B/2 steps are required before
the next page fault occurs. When the next page fault occurs, the algorithm
switches back to a block in the first array, which again guarantees that the
next B/2 − 1 steps cannot cause a page fault. That is, the paging algorithm
alternates between the two arrays. Traversing a path of length L now incurs
at most 2L/B page faults.
Blocking Trees. Next we discuss blocking of trees. Quite surprisingly, trees
cannot be blocked at all if there are no additional restrictions on the tree or
the type of traversal that is allowed. To see this, consider a tree whose internal
vertices have degree at least M . Then for any vertex v, at most M − 1 of its
neighbors can reside in main memory at the same time as v. Hence, there is
at least one neighbor of v that is not in main memory at the time when v is
in main memory. An adversary would always select this vertex as the one to
be visited after v. Since at least every other vertex on any path in the tree
has to be an internal vertex, the adversary can construct a path that causes a
page fault every other step along the path. Note that this is true irrespective
of the storage blow-up of the graph representation.
From this simple example it follows that for unrestricted traversals, a good
blocking of a tree can be achieved only if the degree of the vertices of the tree
is bounded by some constant d. We show that there is a blocking with storage
blow-up four so that traversing a path of length L causes at most 2L/ logd B
page faults. To construct this layout, which is very similar to the list layout
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 57

logd B

Fig. 3.4. A blocking of a binary tree with block size 7. The subtrees in the ﬁrst
partition are outlined with dashed lines. The subtrees in the second partition are
outlined with solid lines.

shown in Fig. 3.3, we choose one vertex r of T as the root and construct two
partitions of T into layers of height logd B (see Fig. 3.4). In the first partition,
the i-th layer contains all vertices at distance between (i − 1) logd B and
i logd B − 1 from r. In the second partition, the i-th layer contains all vertices
at distance between (i − 1/2) logd B and (i + 1/2) logd B = 1 from r. Each
layer in both partitions consists of subtrees of size at most B, so that each
subtree can be stored in a block. Moreover, small subtrees can be packed into
blocks so that no block is less than half full. Hence, both partitions together
use at most 4N/B blocks, and the storage blow-up is at most four.
The paging algorithm now alternates between the two partitions similar to
the above paging algorithm for lists. Consider the traversal of a path, and let
v be a vertex that causes a page fault. Assume that the tree currently held in
main memory is from the first partition. Then v is the root or a leaf of a tree
in the first partition. Hence, the tree in the second partition that contains v
contains all vertices that can be reached from v in (logd B)/2 − 1 steps. Thus,
by loading this block into main memory, the algorithm guarantees that the
next page fault occurs after at least (logd B)/2 − 1 steps, and traversing a
path of length L causes at most 2L/(logd B) page faults.
If all traversed paths are restricted to travel away from the root of T ,
the storage blow-up can be reduced to two, and the number of page faults
can be reduced to L/ logd B. To see this, observe that only the first of the
above partitions is needed, and for any traversed path, the vertices causing
page faults are the roots of subtrees in the partition. After loading the block
containing that root into main memory, logd B − 1 steps are necessary in
order to reach a leaf of the subtree, and the next page fault occurs after
logd B steps. For traversals towards the root, Hutchinson et al. [419] show
that using O(N/B) disk blocks, a page fault occurs every Ω(B) steps, so that
a path of length L can be traversed in O(L/B) I/Os.
58 Anil Maheshwari and Norbert Zeh

Blocking Two-Dimensional Grids. Blocking of two-dimensional grids

can be done using the same ideas as for blocking lists. This is not too sur-
prising because lists are equivalent to one-dimensional grids from a blocking
point of view. In both cases, the grid is covered with subgrids
√ √of size B. In
the two-dimensional case, the subgrids have dimension B × B. We call
such a covering a tessellation.
However, the added dimension does create a few complications. In par-
ticular, if the amount of main memory is M = B, a blocking that consists
of three tessellations is required to guarantee that a page fault occurs only
every ω(1) steps. To see that two tessellations are not sufficient, consider two
tessellations offset by k and l in the x√and y-dimensions.
√ √Then an adversary√
chooses a path containing vertices i B + k, j B , i B + k + 1, Bj ,
√ √ √ √
i B + k, j B + 1 , and i B + k + 1, j B + 1 , for two integers i and j.
Neither of the two tessellations contains a subgrid that contains more than
two of these vertices. Hence, given that the traversal is at one of the four
vertices, and only one of the two subgrids containing this vertex is in main
memory, an adversary can always choose the neighbor of the current vertex
that does not belong to the subgrid in main memory. Thus, every step causes
a page fault.
By increasing the storage blow-up √ to three, it can be guaranteed that
a page fault occurs at most every B/6 steps. In particular, the blocking √
consists of three tessellations so that the second tessellation has offset B/3
in both√directions w.r.t. the first tessellation, and the third tessellation has
offset 2 B/3 w.r.t. the first tessellation (see Fig. 3.5). Then it is not hard
to see that for every vertex in the grid, there exists at least√ one subgrid in
one of the three tessellations, so that the vertex is at least B/6 steps away
from the boundary of the subgrid. Hence, whenever a page fault occurs at
some vertex v, the paging algorithm brings the appropriate grid for vertex v
into main memory.
If M ≥ 2B, the storage blow-up can be reduced √ to two, and the number
of steps between two page faults can be increased to B/4. In particular, the
blocking
√ consists of two tessellations so that the second tessellation has offset
B/2 in both directions w.r.t. the first tessellation. Let v be a vertex that
causes a page fault, and let u be the vertex visited before v in the traversal.
Then at the time when the paging algorithm has to bring vertex v into main
memory, a block containing u from one of the two tessellations is already in
main memory. Now it is easy to verify that the block containing u that is
already in main memory together with one of √ the two blocks containing v
in the two tessellations covers at least the B/4-neighborhood of v, i.e.,
the
√ subgrid containing all vertices that can be reached from v in at most
B/4 steps. Hence, by bringing the appropriate block containing v into main
memory, the √ paging algorithm can guarantee that the next page fault occurs
after only B/4 steps.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 59

Fig. 3.5. A blocking for M = B.

Finally, if M ≥ 3B, the storage blow-up can be brought down to one

while keeping the number
√ of page faults incurred by the traversal of a path
of length L at 4L/ B. To achieve this, the tessellation shown in Fig. 3.6 √ is
used. To prove that the traversal of a path of length L incurs at most 4L/ √ B
page faults, we show that at most two page faults can occur within B/2
steps. So assume the opposite. Then let v be a vertex that causes a page fault,
and let u be the vertex visited immediately before v. Let u be in the solid bold
subgrid in Fig. 3.6 and assume that it is in the top
√ left quarter of the subgrid.
Then all vertices that can be reached from u in B/2 steps are contained in
the solid thin subgrids. In particular, √
v is contained in one of these subgrids.
If v is in subgrid A, it takes at least B/2 steps after visiting
√ v to reach a
vertex in subgrid C. If v is in subgrid C, it takes at least B/2 steps after
visiting v to reach a vertex in subgrid A. Hence, in√both cases only a vertex
in subgrid B can cause another page fault within B/2 steps after visiting
vertex u. If v is in subgrid B,√consider the next vertex w after v that causes
a page fault and is at most B/2 steps away from u. Vertex w is either in
subgrid √ A, or in subgrid C. W.l.o.g. let w be in subgrid A. Then it takes
at least B/2 steps to reach√a vertex in subgrid C, so that again only two
page faults can occur within B/2 steps after visiting√ u. This shows that the
traversal of a path of length L incurs at most 4L/ B page faults.
Blocking Planar Graphs. The ﬁnal result we discuss here concerns the
blocking of planar graphs of bounded degree. Quite surprisingly, planar graphs
allow blockings with the same performance as for trees, up to constant factors.
That is, with constant storage blow-up it can be guaranteed that traversing
a path of length L incurs at most 4L/ logd B page faults, where d is the
maximal degree of the vertices in the graph. To achieve this, Agarwal et
al. [7] make use of separator results due to Frederickson [315]. In particular,
60 Anil Maheshwari and Norbert Zeh

B
u

Fig. 3.6. A blocking with storage blow-up one.

Frederickson
√ shows that for every planar graph G, there exists a set S of
O N/ B vertices so that no connected component of G − S has size more
than B. Based on this result, the following graph representation can be used
to achieve the above result. First ensure that every connected component
of G − S is stored in a single block and pack small connected components
into blocks so that every block is at least half full. This representation of
G − S uses at most 2N/B disk blocks. The second part of the blocking
consists of the (logd B)/2-neighborhoods of the vertices in S. That is, for
every vertex v ∈ S, the vertices reachable from v in at most (logd B)/2 steps
are stored in a single
√ block. These vertices ﬁt into a single block because at
most d(logd B)/2 = B vertices can be reached in that many steps from v.
Packing these neighborhoods into blocks so that√ every block is at least half
full, this second part of the blocking uses O B|S|/B = O(N/B) blocks.
Hence, the storage blow-up is O(1).
Now consider the exploration of an arbitrary path in G. Let v be a vertex
that causes a page fault. If v ∈ S, the paging algorithm brings the block con-
taining the (logd B)/2-neighborhood of v into main memory. This guarantees
that at least (logd B)/2 steps along the path are required before the next
page fault occurs. If v ∈ S, v ∈ G − S. Then the paging algorithm brings
the block containing the connected component of G − S that contains v into
main memory. As long as the path stays inside this connected component, no
further page faults occur. When the next page fault occurs, it has to happen
at a vertex w ∈ S. Hence, the paging algorithm brings the block containing
the neighborhood of w into main memory, and at least (logd B)/2 steps are
required before the next page fault occurs. Thus, at most two page faults oc-
cur every (logd B)/2 steps, and traversing a path of length L incurs at most
4L/(logd B) page faults. This is summarized in the following theorem.
3. A Survey of Techniques for Designing I/O-Eﬃcient Algorithms 61

Theorem 3.9 (Agarwal et al. [7]). A planar graph with N vertices of

degree at most d can be stored in O(N/B) blocks so that any path of length
L can be traversed in O(L/ logd B) I/Os.

3.8 Remarks

In this chapter we have seen some of the fundamental techniques used in

the design of efficient external memory algorithms. There are many other
techniques that are not discussed in this chapter, but are scattered across
various other chapters in this volume. I/O-efficient data structures are dis-
cussed in Chapter 2. Techniques used in computational geometry, including
distribution sweeping and batch filtering, are discussed in Chapter 6. Many
specialized graph algorithms are discussed in Chapters 4 and 5. These include
algorithms for connectivity problems, breadth-first search, depth-first search,
shortest paths, partitioning planar graphs, and computing planar embed-
dings of planar graphs. The simulation of bulk synchronous (coarse grained)
parallel algorithms is discussed in Chapter 15.
4. Elementary Graph Algorithms in External
Memory∗
Irit Katriel and Ulrich Meyer∗∗

4.1 Introduction

Solving real-world optimization problems frequently boils down to process-

ing graphs. The graphs themselves are used to represent and structure rela-
tionships of the problem’s components. In this chapter we review external-
memory (EM) graph algorithms for a few representative problems:
Shortest path problems are among the most fundamental and also the
most commonly encountered graph problems, both in themselves and as sub-
problems in more complex settings [21]. Besides obvious applications like
preparing travel time and distance charts [337], shortest path computations
are frequently needed in telecommunications and transportation industries
[677], where messages or vehicles must be sent between two geographical lo-
cations as quickly or as cheaply as possible. Other examples are complex
traﬃc ﬂow simulations and planning tools [337], which rely on solving a large
number of individual shortest path problems. One of the most commonly
encountered subtypes is the Single-Source Shortest-Path (SSSP) version: let
G = (V, E) be a graph with |V | nodes and |E| edges, let s be a distinguished
vertex of the graph, and c be a function assigning a non-negative real weight
to each edge of G. The objective of the SSSP is to compute, for each vertex v
reachable from s, the weight dist(v) of a minimum-weight (“shortest”) path
from s to v; the weight of a path is the sum of the weights of its edges.
Breadth-First Search (BFS) [554] can be seen as the unweighted version of
SSSP; it decomposes a graph into levels where level i comprises all nodes that
can be reached from the source via i edges. The BFS numbers also impose an
order on the nodes within the levels. BFS has been widely used since the late
1950’s; for example, it is an ingredient of the classical separator algorithm
for planar graphs [507].
Another basic graph-traversal approach is Depth-First Search (DFS) [407];
instead of exploring the graph in levels, DFS tries to visit as many graph
vertices as possible in a long, deep path. When no edge to an unvisited node
∗
Partially supported by the Future and Emerging Technologies programme of
the EU under contract number IST-1999-14186 (ALCOM-FT) and by the DFG
grant SA 933/1-1.
∗∗
Partially supported by the Center of Excellence programme of the EU under
contract number ICAI-CT-2000-70025. Parts of this work were done while the
author was visiting the Computer and Automation Research Institute of the
Hungarian Academy of Sciences, Center of Excellence, MTA SZTAKI, Budapest.

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 62-84, 2003.
© Springer-Verlag Berlin Heidelberg 2003
4. Elementary Graph Algorithms in External Memory 63

can be found from the current node then DFS backtracks to the most recently
visited node with unvisited neighbor(s) and continues there. Similar to BFS,
DFS has proved to be a useful tool, especially in artificial intelligence [177].
Another well-known application of DFS is in the linear-time algorithm for
finding strongly connected components [713].
Graph connectivity problems include Connected Components (CC), Bi-
connected Components (BCC) and Minimum Spanning Forest (MST/MSF).
In CC we are given a graph G = (V, E) and we are to find and enumerate
maximal subsets of the nodes of the graph in which there is a path between
every two nodes. In BCC, two nodes are in the same subset iff there are
two edge-disjoint paths connecting them. In MST/MSF the objective is to
find a spanning tree of G (spanning forest if G is not connected) with a
minimum total edge weight. Both problems are central in network design;
the obvious applications are checking whether a communications network is
connected or designing a minimum cost network. Other applications for CC
include clustering, e.g., in computational biology [386] and MST can be used
to approximate the traveling salesman problem within a factor of 1.5 [201].
We use the standard model of external memory computation [755]:
There is a main memory of size M and an external memory consisting of
D disks. Data is moved in blocks of size B consecutive words. An I/O-
operation can move up to D blocks, one from each disk. Further details
about models for memory hierarchies can be found in Chapter 1. We will
usually describe the algorithms under the assumption D = 1. In the fi-
nal results, however, we will provide the I/O-bounds for general D ≥ 1 as
well. Furthermore, we shall frequently use the following notational short-
cuts: scan(x) := O(x/(D · B)), sort(x) := O(x/(D · B) · logM/B (x/B)), and
perm(x) := O(min{x/D, sort(x)}).
Organization of the Chapter. We discuss external-memory algorithms for
all the problems listed above. In Sections 4.2 – 4.7 we cover graph traversal
problems (BFS, DFS, SSSP) and Sections 4.8 – 4.13 provide algorithms for
graph connectivity problems (CC, BCC, MSF).

4.2 Graph-Traversal Problems: BFS, DFS, SSSP

In the following sections we will consider the classical graph-traversal prob-

lems Breadth-First Search (BFS), Depth-First Search (DFS), and Single-
Source Shortest-Paths (SSSP). All these problems are well-understood in
internal memory (IM): BFS and DFS can be solved in O(|V | + |E|) time
[21], SSSP with nonnegative edge weights requires O(|V | · log |V | + |E|) time
[252, 317]. On more powerful machine models, SSSP can be solved even faster
[374, 724].
Most IM algorithms for BFS, DFS, and SSSP visit the vertices of the
input graph G in a one-by-one fashion; appropriate candidate nodes for the
64 Irit Katriel and Ulrich Meyer

next vertex to be visited are kept in some data-structure Q (a queue for BFS,
a stack for DFS, and a priority-queue for SSSP). After a vertex v is extracted
from Q, the adjacency list of v, i.e., the set of neighbors of v in G, is examined
in order to update Q: unvisited neighboring nodes are inserted into Q; the
priorities of nodes already in Q may be updated.
The Key Problems. The short description above already contains the main
difficulties for I/O-efficient graph-traversal algorithms:
(a)Unstructured indexed access to adjacency lists.
(b)Remembering visited nodes.
(c) (The lack of) Decrease Key operations in external priority-queues.
Whether (a) is problematic or not depends on the sizes of the adjacency
lists; if a list contains k edges then it takes Θ(1 + k/B) I/Os to retrieve all
its edges. That is fine if k = Ω(B), but wasteful if k = O(1). In spite of
intensive research, so far there is no general solution for (a) on sparse graphs:
unless the input is known to have special properties (for example planarity),
virtually all EM graph-traversal algorithms require Θ(|V |) I/Os to access
adjacency lists. Hence, we will mainly focus on methods to avoid spending
one I/O for each edge on general graphs1 . However, there is recent progress
for BFS on arbitrary undirected√graphs [542]; e.g., if |E| = O(|V |), the new
algorithm requires just O(|V |/ B + sort(|V |)) I/Os. While this is a major
step forward for BFS on undirected graphs, it is currently unclear whether
similar results can be achieved for undirected DFS/SSSP or BFS/DFS/SSSP
on general directed graphs.
Problem (b) can be partially overcome by solving the graph problems
in phases [192]: a dictionary DI of maximum capacity |DI| < M is kept
in internal memory; DI serves to remember visited nodes. Whenever the
capacity of DI is exhausted, the algorithms make a pass through the external
graph representation: all edges pointing to visited nodes are discarded, and
the remaining edges are compacted into new adjacency lists. Then DI is
emptied, and a new phase starts by visiting the next element of Q. This
phase-approach explored in [192] is most efficient if the quotient |V |/|DI| is
small2 ; O(|V |/|DI| · scan(|V | + |E|)) I/Os are needed in total to perform
all graph compactions. Additionally, O(|V | + |E|) operations are performed
on Q.
As for SSSP, problem (c) is less severe if (b) is resolved by the phase-
approach: instead of actually performing Decrease Key operations, several
priorities may be kept for each node in the external priority-queue; after a
node v is dequeued for the first time (with the smallest key) any further
appearance of v in Q will be ignored. In order to make this work, superfluous
1
In contrast, the chapter by Toma and Zeh in this volume (Chapter 5) reviews
improved algorithms for special graph classes such as planar graphs.
2
The chapters by Stefan Edelkamp (Chapter 11) and Rasmus Pagh (Chapter 2)
in this book provide more details about space-efficient data-structures.
4. Elementary Graph Algorithms in External Memory 65

elements still kept in the EM data structure of Q are marked obsolete right
before DI is emptied at the end of a phase; the marking can be done by
scanning Q.
Plugging-in the I/O-bounds for external queues, stacks, and priority-
queues as presented in Chapter 2 we obtain the following results:

Problem Performance with the phase-approach [192]

|V |
BFS, DFS O |V | + M
· scan(|V | + |E|) I/Os

|V |
SSSP O |V | + M
· scan(|V | + |E|) + sort(|E|) I/Os

Another possibility is to solve (b) and (c) by applying extra bookkeeping

and extra data structures like the I/O-efficient tournament tree of Kumar and
Schwabe [485]. In that case the graph traversal can be done in one phase,
even if n M .
It turns out that the best known EM algorithms for undirected graphs are
simpler and/or more efficient than their respective counterparts for directed
graphs; due to the new algorithm √ of [542], the difference for BFS on sparse
graphs is currently as big as Ω( B · log |V |).
Problems (b) and (c) usually disappear in the semi-external memory
(SEM) setting where it is assumed that M = c · |V | < |E| for some ap-
propriately chosen positive constant c: e.g., the SEM model may allow to
keep a boolean array for (b) in internal memory; similarly, a node priority
queue with Decrease Key operation for (c) could reside completely in IM.
Still, due to (a), the currently best EM/SEM algorithms for BFS, DFS
and SSSP require Ω(|V |+|E|/B) I/Os on general directed graphs. Taking into
consideration that naive applications of the best IM algorithms in external
memory cause O(|V | · log |V | + |E|) I/Os we see how little has been achieved
so far concerning general sparse directed graphs. On the other hand, it is per-
fectly unclear, whether one can do significantly better at all: the best known
lower-bound is only Ω(perm(|V |) + |E|/B) I/Os (by the trivial reductions of
list ranking to BFS, DFS, and SSSP).
In the following we will present some one-pass approaches in more detail.
In Section 4.3 we concentrate on algorithms for undirected BFS. Section 4.4
introduces I/O-efficient Tournament Trees; their application to undirected
EM SSSP is discussed in Section 4.5. Finally, Section 4.6 provides traversal
algorithms for directed graphs.
66 Irit Katriel and Ulrich Meyer

4.3 Undirected Breadth-First Search

Our exposition of one-pass BFS algorithms for undirected graphs is struc-
tured as follows. After some preliminary observations we review the basic
BFS algorithm of Munagala and Ranade [567] in Section 4.3.1. Then, in
Section 4.3.2, we present the recent improvement by Mehlhorn and Meyer
[542]. This algorithm can be seen as a refined implementation of the Muna-
gala/Ranade approach. The new algorithm clearly outperforms a previous
BFS approach [550], which was the first to achieve o(|V |) I/Os on undi-
rected sparse graphs with bounded node degrees. A tricky combination of
both strategies might help to solve directed BFS with sublinear I/O; see Sec-
tion 4.7 for more details.
We restrict our attention to computing the BFS level of each node v,
i.e., the minimum number of edges needed to reach v from the source. For
undirected graphs, the respective BFS tree or the BFS numbers (order of the
nodes in a level) can be obtained efficiently: in [164] it is shown that each
of the following transformations can be done using O(sort(|V | + |E|)) I/Os:
BFS Numbers → BFS Tree → BFS Levels → BFS Numbers.
The conversion BFS Numbers → BFS Tree is done as follows: for each
node v ∈ V , the parent of v in the BFS tree is the node v with BFS number
bfsnum(v ) = min(v,w)∈E bfsnum(w). The adjacency lists can be augmented
with the BFS numbers by sorting. Another scan suffices to extract the BFS
tree edges.
As for the conversion BFS Tree → BFS Levels, an Euler tour [215] around
the undirected BFS tree can be constructed and processed using scanning and
list ranking; Euler tour edges directed towards the leaves are assigned a weight
+1 whereas edges pointing towards the root get weight −1. A subsequent
prefix-sum computation [192] on the weights of the Euler tour yields the
appropriate levels.
The last transformation BFS Levels → BFS Numbers proceeds level-by-
level: having computed correct numbers for level i, the order (BFS numbers)
of the nodes in level i + 1 is given as follows: each level-(i + 1) node v must be
a child (in the BFS tree) of its adjacent level-i node with least BFS number.
After sorting the nodes of level i and the edges between levels i and i + 1,
a scan provides the adjacency lists of level-(i + 1) nodes with the required
information.

4.3.1 The Algorithm of Munagala and Ranade

We turn to the basic BFS algorithm of Munagala and Ranade [567], MR BFS
for short. It is also used as a subroutine in more recent BFS approaches [542,
550]. Furthermore, MR BFS is applied in the deterministic CC algorithm of
[567] (which we discuss in Section 4.9).
Let L(t) denote the set of nodes in BFS level t, and let |L(t)| be the num-
ber of nodes in L(t). MR BFS builds L(t) as follows: let A(t) := N (L(t − 1))
4. Elementary Graph Algorithms in External Memory 67

L(t−2) L(t−1) L(t)

b d f a a a
N(b) c c
d d d d
a
s a N(c) b b
d
c e e e e e
N( L(t−1) ) − dupl. − L(t−1) − L(t−2)

Fig. 4.1. A phase in the BFS algorithm of Munagala and Ranade [567]. Level L(t)
is composed out of the disjoint neighbor vertices of level L(t − 1) excluding those
vertices already existing in either L(t − 2) or L(t − 1).

be the multi-set of neighbor vertices of nodes in L(t − 1); N (L(t − 1)) is

created by |L(t − 1)| accesses to the adjacency lists, one for each node in
L(t − 1). Since the graph is stored in adjacency-list representation, this takes
O (|L(t − 1)| + |N (L(t − 1))|/B) I/Os. Then the algorithm removes dupli-
cates from the multi-set A. This can be done by sorting A(t) according to
the node indices, followed by a scan and compaction phase; hence, the du-
plicate elimination takes O(sort(|A(t)|) I/Os. The resulting set A (t) is still
sorted.
Now the algorithm computes L(t) := A (t) \ {L(t − 1) ∪ L(t − 2)}. Fig. 4.1
provides an example. Filtering out the nodes already contained in the sorted
lists L(t − 1) or L(t − 2) is possible by parallel scanning. Therefore, this step
can be done using

O sort |N (L(t − 1))| + scan |L(t − 1)| + |L(t − 2)| I/Os.

Since t |N (L(t))| = O(|E|) and t |L(t)| = O(|V |), the whole execution of
MR BFS requires O(|V | + sort(|E|)) I/Os.
The correctness of this BFS algorithm crucially depends on the fact that
the input graph is undirected. Assume that the levels L(0), . . . , L(t − 1) have
already been computed correctly. We consider a neighbor v of a node u ∈
L(t − 1): the distance from s to v is at least t − 2 because otherwise the
distance of u would be less than t − 1. Thus v ∈ L(t − 2) ∪ L(t − 1) ∪ L(t) and
hence it is correct to assign precisely the nodes in A (t) \ {L(t − 1) ∪ L(t − 2)}
to L(t).
Theorem 4.1 ([567]). BFS on arbitrary undirected graphs can be solved
using O(|V | + sort(|V | + |E|)) I/Os.

4.3.2 An Improved BFS Algorithm

The Fast BFS algorithm of Mehlhorn and Meyer [542] reﬁnes the approach
of Munagala and Ranade [567]. It trades-oﬀ unstructured I/Os with increas-
ing the number of iterations in which an edge may be involved. Fast BFS
68 Irit Katriel and Ulrich Meyer

operates in two phases: in a ﬁrst phase it preprocesses the graph and in a

second phase it performs BFS using the information gathered in the first
phase. We first sketch a variant with a randomized preprocessing. Then we
outline a deterministic version.
The Randomized Partitioning Phase. The preprocessing step parti-
tions the graph into disjoint connected subgraphs Si , 0 ≤ i ≤ K, with
small expected diameter. It also partitions the adjacency lists accordingly,
i.e., it constructs an external file F = F0 F1 . . . Fi . . . FK−1 where Fi con-
tains the adjacency lists of all nodes in Si . The partition is built by choos-
ing master nodes
independently and uniformly at random with probability
µ = min{1, (|V | + |E|)/(B · |V |)} and running a local BFS from all mas-
ter nodes “in parallel” (for technical reasons, the source node s is made the
master node of S0 ): in each round, each master node si tries to capture all un-
visited neighbors of its current sub-graph Si ; this is done by first sorting the
nodes of the active fringes of all Si (the nodes that have been captured in the
previous round) and then scanning the dynamically shrinking adjacency-lists
representation of the yet unexplored graph. If several master nodes want to
include a certain node v into their partitions then an arbitrary master node
among them succeeds. The selection can be done by sorting and scanning the
created set of neighbor nodes.
The expected number of master nodes is K := O(1 + µ · n) and the
expected shortest-path distance (number of edges) between any two nodes of
a subgraph is at most 2/µ. Hence, the expected total amount of data being
scanned from the adjacency-lists representation during the “parallel partition
growing” is bounded by

X := O( 1/µ · (1 + degree(v))) = O((|V | + |E|)/µ).
v∈V

The total number of fringe nodes and neighbor nodes sorted and scanned dur-
ing the partitioning is at most Y := O(|V | + |E|). Therefore, the partitioning
requires

O(scan(X) + sort(Y )) = O(scan(|V | + |E|)/µ + sort(|V | + |E|))

expected I/Os.
After the partitioning phase each node knows the (index of the) sub-
graph to which it belongs. With a constant number of sort and scan
operations Fast BFS can reorganize the adjacency lists into the format
F0 F1 . . . Fi . . . F|S|−1 , where Fi contains the adjacency lists of the nodes in
partition Si ; an entry (v, w, S(w), fS(w) ) from the adjacency list of v ∈ Fi
stands for the edge (v, w) and provides the additional information that w be-
longs to subgraph S(w) whose subﬁle FS(w) starts at position fS(w) within F .
The edge entries of each Fi are lexicographically sorted. In total, F occupies
O((|V | + |E|)/B) blocks of external storage.
4. Elementary Graph Algorithms in External Memory 69

The BFS Phase. In the second phase the algorithm performs BFS as de-
scribed by Munagala and Ranade (Section 4.3.1) with one crucial difference:
Fast BFS maintains an external file H (= hot adjacency lists); it comprises
unused parts of subfiles Fi that contain a node in the current level L(t − 1).
Fast BFS initializes H with F0 . Thus, initially, H contains the adjacency
list of the root node s of level L(0). The nodes of each created BFS level will
also carry identifiers for the subfiles Fi of their respective subgraphs Si .
When creating level L(t) based on L(t − 1) and L(t − 2), Fast BFS does
not access single adjacency lists like MR BFS does. Instead, it performs a
parallel scan of the sorted lists L(t − 1) and H and extracts N (L(t − 1));
In order to maintain the invariant that H contains the adjacency lists of all
vertices on the current level, the subfiles Fi of nodes whose adjacency lists
are not yet included in H will be merged with H. This can be done by first
sorting the respective subfiles and then merging the sorted set with H using
one scan. Each subfile Fi is added to H at most once. After an adjacency list
was copied to H, it will be used only for O(1/µ) expected steps; afterwards it
can be discarded from H. Thus, the expected total data volume for scanning
H is O(1/µ · (|V | + |E|)), and the expected total number of I/Os to handle H
and Fi is O (µ · |V | + sort(|V
| + |E|) + 1/µ · scan(|V | + |E|)). The final result
follows with µ = min{1, scan(|V | + |E|)/|V |}.

Theorem 4.2 ([542]). External memory BFS on undirected graphs can be

solved using O |V | · scan(|V | + |E|) + sort(|V | + |E|) expected I/Os.

The Deterministic Variant. In order to obtain the result of Theorem 4.2

in the worst case, too, it is sufficient to modify the preprocessing phase of
Section 4.3.2 as follows: instead of growing subgraphs around randomly se-
lected master nodes, the deterministic variant extracts the subfiles Fi from
an Euler tour [215] around a spanning tree for the connected component Cs
that contains the source node s. Observe that Cs can be obtained with the
deterministic connected-components algorithm of [567] using
O((1
+ log log(B · |V |/|E|)) · sort(|V | + |E|)) =
O( |V | · scan(|V | + |E|) + sort(|V | + |E|)) I/Os. The same number of I/Os
suffices to compute a (minimum) spanning tree Ts for Cs [60].
After Ts has been built, the preprocessing constructs an Euler tour around
Ts using a constant number of sort- and scan-steps [192]. Then the tour is
broken at the root node s; the elements of the resulting list can be stored in
consecutive order using the deterministic list ranking algorithm of [192]. This
takes O(sort(|V |)) I/Os. Subsequently, the Euler tour can be cut into pieces
of size 2/µ in a single scan. These Euler tour pieces account for subgraphs Si
with the property that the distance between any two nodes of Si in G is at
most 2/µ − 1. See Fig. 4.2 for an example. Observe that a node v of degree d
may be part of Θ(d) different subgraphs Si . However, with a constant number
of sorting steps it is possible to remove multiple node appearances and make
sure that each node of Cs is part of exactly one subgraph Si (actually there
70 Irit Katriel and Ulrich Meyer

s 0
2/µ
4 6 0 4 2 4 7 4 0 6 1 8 1 5 1 3 1 6 0

2 7 1 0 4 2 7 6 1 8 5 3

8 5 3

Fig. 4.2. Using an Euler tour around a spanning tree of the input graph in order
to obtain a partition for the deterministic BFS algorithm.

are special algorithms for duplicate elimination, e.g. [1, 534]). Eventually,
the reduced subgraphs Si are used to create the reordered adjacency-list
files Fi ; this is done as in the randomized preprocessing and takes another
O(sort(|V | + |E|)) I/Os. Note that the reduced subgraphs Si may not be
connected any more; however, this does not matter as our approach only
requires that any two nodes in a subgraph are relatively close in the original
input graph.
The BFS-phase of the algorithm remains unchanged; the modified pre-
processing, however, guarantees that each adjacency-list will be part of the
external set H for at most 2/µ BFS levels: if a subfile Fi is merged with
H for BFS level L(t), then the BFS level of any node v in Si is at most
L(t) + 2/µ − 1. Therefore, the adjacency list of v in Fi will be kept in H for
at most 2/µ BFS levels.
Theorem 4.3 ([542]). External memory BFS on undirected graphs can be

solved using O |V | · scan(|V | + |E|) + sort(|V | + |E|) I/Os in the worst
case.

4.4 I/O-Eﬃcient Tournament Trees

In this section we review a data structure due to Kumar and Schwabe [485]
which proved helpful in the design of better EM graph algorithms: the I/O-
eﬃcient tournament tree, I/O-TT for short. A tournament tree is a complete
binary tree, where some rightmost leaves may be missing. In a ﬁgurative
sense, a standard tournament tree models the outcome of a k-phase knock-
out game between |V | ≤ 2k players, where player i is associated with the i-th
leaf of the tree; winners move up in the tree.
The I/O-TT as described in [485] is more powerful: it works as a priority
queue with the Decrease Key operation. However, both the size of the data
structure and the I/O-bounds for the priority queue operations depend on
the size of the universe from which the entries are drawn. Used in connection
with graph algorithms, the static I/O-TT can host at most |V | elements with
4. Elementary Graph Algorithms in External Memory 71

Index
M’
Key (Priority)

M’ Signals
Internal Memory

External Memory

M’ M’

Signals Elements

M’ M’

M’ M’ M’

1 M’ M’+1 2M’ n−M’ n

Fig. 4.3. Principle of an I/O-eﬃcient tournament tree. Signals are traveling from
the root to the leaves; elements move in opposite direction.

pairwise disjoint indices in {1, . . . , |V |}. Besides its index x, each element also
has a key k (priority). An element x1 , k1 is called smaller than x2 , k2 if
k1 < k2 .
The I/O-TT supports the following operations:
(i) deletemin: extract the element x, k with smallest key k and replace it
by the new entry x, ∞ .
(ii) delete(x): replace x, oldkey by x, ∞ .
(iii) update(x,newkey): replace x, oldkey by x, newkey if newkey < oldkey .
Note that (ii) and (iii) do not require the old key to be known. This feature
will help to implement the graph-traversal algorithms of Section 4.5 without
paying one I/O for each edge (for example an SSSP algorithm does not have to
ﬁnd out explicitly whether an edge relaxation leads to an improved tentative
distance).
Similar to other I/O-eﬃcient priority queue data structures (see Chapter 2
of Rasmus Pagh for an overview) I/O-TTs rely on the concept of lazy batched
processing. Let M = c · M for some positive constant c < 1; the static
I/O-TT for |V | entries only has |V |/M leaves (instead of |V | leaves in the
standard tournament tree). Hence, there are O(log2 (|V |/M )) levels. Elements
with indices in the range {(i − 1) · M + 1, . . . , i · M } are mapped to the i-
th leaf. The index range of internal nodes of the I/O-TT is given by the
72 Irit Katriel and Ulrich Meyer

union of the index ranges of their children. Internal nodes of the I/O-TT
keep a list of at least M /2 and at most M elements each (sorted according
to their priorities). If the list of a tree node v contains z elements, then
they are the smallest z out of all those elements in the tree being mapped
to the leaves that are descendants of v. Furthermore, each internal node is
equipped with a signal buﬀer of size M . Initially, the I/O-TT stores the
elements 1, +∞ , 2, +∞ , . . . , |V |, +∞ , out of which the lists of internal
nodes keep at least M /2 elements each. Fig. 4.3 illustrates the principle of
an I/O-TT.

4.4.1 Implementation of the Operations

The operations (i)–(iii) generate signals which serve to propagate information

down the tree; signals are inserted into the root node, which is kept in internal
memory. When a signal arrives in a node it may create, delete or modify an
element kept in this node; the signal itself may be discarded, altered or remain
unchanged. Non-discarded signals are stored until the capacity of the node’s
buffer is exceeded; then they are sent down the tree towards the unique leaf
node its associated element is mapped to.
Operation (i) removes an element x, k with smallest key k from the root
node (in case there are no elements in the root it is recursively refilled from
its children). A signal is sent on the path towards the leaf associated with
x in order to reinsert x, +∞ . The reinsertion takes place at the first tree
node on the path with free capacity whose descendents are either empty or
exclusively host elements with key infinity.
Operation (ii) is done in a similar way as (i); a delete signal is sent towards
the leaf node that index x is mapped to; the signal will eventually meet the
element x, oldkey and cause its deletion. Subsequently, the delete signal is
converted into a signal to reinsert x, +∞ and proceeds as in case (i).
Finally, the signal for operation (iii) traverses its predefined tree path until
either some node vnewkey with appropriate key range is found or the element
x, oldkey is met in some node voldkey . In the latter case, if oldkey > newkey
then x, newkey will replace x, oldkey in the list of voldkey ; if oldkey ≤
newkey nothing changes. Otherwise, i.e., if x, newkey belongs to a tree node
vnewkey closer to the root than voldkey , then x, newkey will be added to the
list of vnewkey (in case this exceeds the capacity of vnewkey then the largest list
element is recursively flushed to the respective child node of vnewkey using a
special flush signal). The update signal for update(x,newkey) is altered into a
delete signal for x, oldkey , which is sent down the tree in order to eliminate
the obsolete entry for x with the old key.
It can be observed that each operation from (i)–(iii) causes at most two
signals to travel all the way down to a leaf node. Overflowing buffers with
X > M signals can be emptied using O(X/B) I/Os. Elements moving up
the tree can be charged to signals traveling down the tree. Arguing more
formally along these lines, the following amortized bound can be shown:
4. Elementary Graph Algorithms in External Memory 73

Theorem 4.4 ([485]). On an I/O-eﬃcient tournament tree with |V | ele-

ments, any sequence of z delete/deletemin/update operations requires at most
O(z/B · log2 (|V |/B)) I/Os.

4.5 Undirected SSSP with Tournament Trees

In the following we sketch how the I/O-efficient tournament tree of Section 4.4
can be used in order to obtain improved EM algorithms for the single source
shortest path problem. The basic idea is to replace the data structure Q for
the candidate nodes of IM traversal-algorithms (Section 4.2) by the EM
tournament tree. The resulting SSSP algorithm works for undirected graphs
with strictly positive edge weights.
The SSSP algorithm of [485] constructs an I/O-TT for the |V | vertices
of the graph and sets all keys to infinity. Then the key of the source node
is updated to zero. Subsequently, the algorithm operates in |V | iterations
similarly to Dijkstra’s approach [252]: iteration i first performs a deletemin
operation in order to extract an element vi , ki ; the final distance of the
extracted node vi is given by dist(vi ) = ki . Then the algorithm issues
update(wj , dist(vi ) + c(vi , wj )) operations on the I/O-TT for each adjacent
edge (vi , wj ), vi = wj , having weight c(vi , wj ); in case of improvements the
new tentative distances will automatically materialize in the I/O-TT.
However, there is a problem with this simple approach; consider an edge
(u, v) where dist(u) < dist(v). By the time v is extracted from the I/O-TT,
u is already settled; in particular, after removing u, the I/O-TT replaces the
extracted entry u, dist(u) by u, +∞ . Thus, performing update(u, dist(v)+
c(v, u) < ∞) for the edge (v, u) after the extraction of v would reinsert the
settled node u into the set Q of candidate nodes. In the following we sketch
how this problem can be circumvented:
A second EM priority-queue3, denoted by SPQ, supporting a sequence of
z deletemin and insert operations with (amortized) O(z/B · log2 (z/B)) I/Os
is used in order to remember settled nodes “at the right time”. Initially, SPQ
is empty. At the beginning of iteration i, the modified algorithm additionally
checks the smallest element ui , ki from SPQ and compares its key ki with
the key ki of the smallest element ui , ki in I/O-TT. Subsequently, only
the element with smaller key is extracted (in case of a tie, the element in
the I/O-TT is processed first). If ki < ki then the algorithm proceeds as
described above; however, for each update(v, dist(u) + c(u, v)) on the I/O-TT
it additionally inserts u, dist(u) + c(u, v) into the SPQ. On the other hand,
if ki < ki then a delete(ui ) operation is performed on I/O-TT as well and a
new phase starts.
3
Several priority queue data structures are appropriate; see Chapter 2 for an
overview.
74 Irit Katriel and Ulrich Meyer

Operation I/O-TT SPQ

... u, dist(u), v, ∗
u =TT deletemin() v, ∗
TT update(v, dist(u) + c(u, v)) v, ∗
SPQ insert(u, dist(u) + c(u, v)) v, ∗ u, dist(u) + c(u, v)
... v, dist(v) u, dist(u) + c(u, v)
v =TT deletemin() u, dist(u) + c(u, v)
TT update(u, dist(v) + c(u, v)) u, dist(v) + c(u, v) u, dist(u) + c(u, v)
... u, dist(v) + c(u, v) u, dist(u) + c(u, v)
u =SPQ deletemin() u, dist(v) + c(u, v)
TT delete(u)

Fig. 4.4. Identifying spurious entries in the I/O-TT with the help of a second
priority queue SPQ.

In Fig. 4.4 we demonstrate the effect for the previously stated problem
concerning an edge (u, v) with dist(u) < dist(v): after node u is extracted
from the I/O-TT for the first time, u, dist(u) + c(u, v) is inserted into SPQ.
Since dist(u) < dist(v) ≤ dist(u) + c(u, v), node v will be extracted from I/O-
TT while u is still in SPQ. The extraction of v triggers a spurious reinsertion
of u into I/O-TT having key dist(v) + c(v, u) = dist(v) + c(u, v) > dist(u) +
c(u, v). Thus, u is extracted as the smallest element in SPQ before the re-
inserted node u becomes the smallest element in I/O-TT; as a consequence,
the resulting delete(u) operation for I/O-TT eliminates the spurious node u
in I/O-TT just in time. Extra rules apply for nodes with identical shortest
path distances.
As already indicated in Section 4.2, one-pass algorithms like the one just
presented still require Θ(|V |+(|V |+|E|)/B) I/Os for accessing the adjacency
lists. However, the remaining operations are more I/O-efficient: O(|E|) op-
erations on the I/O-TT and SPQ add another O(|E|/B · log2 (|E|/B)) I/Os.
Altogether this amounts to O(|V | + |E|/B · log2 (|E|/B)) I/Os.

Theorem 4.5. SSSP on undirected graphs can be solved using

O(|V | + |E|/B · log2 (|E|/B)) I/Os.

The unpublished full version of [485] also provides a one-pass EM algo-

rithm for DFS on undirected graphs. It requires O((|V | + |E|/B) · log2 |V |)
I/Os. A diﬀerent algorithm for directed graphs achieving the same bound will
be sketched in the next section.

4.6 Graph-Traversal in Directed Graphs

The best known one-pass traversal-algorithms for general directed graphs are
often less eﬃcient and less appealing than their undirected counterparts from
4. Elementary Graph Algorithms in External Memory 75

the previous sections. The key difference is that it becomes much more com-
plicated to keep track of previously visited nodes of the graph; the nice trick of
checking a constant number of previous levels for visited nodes as discussed
for undirected BFS does not work for directed graphs. Therefore we store
edges that point to previously seen nodes in a so-called buffered repository
tree (BRT) [164]: A BRT maintains |E| elements with keys in {1, . . . , |V |}
and supports the operations insert(edge, key) and extract all (key ); the latter
operation reports and deletes all edges in the BRT that are associated with
the specified key.
A BRT can be built as a height-balanced static binary tree with |V |
leaves and buffers of size B for each internal tree node; leaf i is associated
with graph node vi and stores up to degree(vi ) edges. Insertions into the
BRT happen at the root; in case of buffer overflow an inserted element (e, i)
is flushed down towards the i-th leaf. Thus, an insert operation requires
amortized O(1/B · log2 |V |) I/Os. If extract all (i) reports x edges then it
needs to read O(log2 |V |) buffers on the path from the root to the i-th leaf;
another O(x/B) disk blocks may have to be read at the leaf itself. This
accounts for O(x/B + log2 |V |) I/Os.
For DFS, an external stack S is used to store the vertices on the path
from the root node of the DFS tree to the currently visited vertex. A step of
the DFS algorithm checks the previously unexplored outgoing edges of the
topmost vertex u from S. If the target node v of such an edge (u, v) has not
been visited before then u is the father of v in the DFS tree. In that case,
v is pushed on the stack and the search continues for v. Otherwise, i.e., if v
has already been visited before, the next unexplored outgoing edge of u will
be checked. Once all outgoing edges of the topmost node u on the stack have
been checked, node u is popped and the algorithm continues with the new
topmost node on the stack.
Using the BRT the DFS procedure above can be implemented I/O effi-
ciently as follows: when a node v is encountered for the first time, then for
each incoming edge ei = (ui , v) the algorithm performs insert (ei , ui ). If at
some later point ui is visited then extract all (ui ) provides a list of all edges
out of ui that should not be traversed again (since they lead to nodes al-
ready seen before). If the (ordered) adjacency list of ui is kept in some EM
priority-queue P (ui ) then all superfluous edges can be deleted from P (ui )
in an I/O-efficient way. Subsequently, the next edge to follow is given by
extracting the minimum element from P (ui ).
The algorithm takes O(|V | + |E|/B) I/Os to access adjacency lists. There
are O(|E|) operations on the n priority queues P (·) (implemented as exter-
nal buffer trees). As the DFS algorithm performs an inorder traversal of a
DFS tree, it needs to change between different P (·) at most O(|V |) times.
Therefore, O(|V | + sort(|E|)) I/Os suffice to handle the operations on all
P (·). Additionally, there are O(|E|) insert and O(|V |) extract all operations
76 Irit Katriel and Ulrich Meyer

on the BRT; the I/Os required for them add up to O((|V | + |E|/B) · log2 |V |)
I/Os.
The algorithm for BFS works similarly, except that the stack is replaced
by an external queue.

Theorem 4.6 ([164, 485]). BFS and DFS on directed graphs can be solved
using O((|V | + |E|/B) · log2 |V |) I/Os.

4.7 Conclusions and Open Problems for Graph Traversal

In the previous sections we presented the currently best EM traversal algo-
rithms for general graphs assuming a single disk. With small modiﬁcations,
the results of Table 4.1 can be obtained for D ≥ 1 disks.

Table 4.1. Graph traversal with D parallel disks.

Problem I/O-Bound

Undir. BFS O |V | · scan(|V | + |E|) + sort(|V | + |E|)
|V | |E|

Dir. BFS, DFS O min |V | + M · scan(|V | + |E|), |V | + D·B · log2 |V |
|V | |E|

Undir. SSSP O min |V | + M · sort(|V | + |E|), |V | + D·B · log2 |V |
|V |
Dir. SSSP O |V | + M · sort(|V | + |E|)

One of the central goals in EM graph-traversal is the reduction of un-

structured I/O for accessing adjacency-lists. The Fast BFS algorithm of
Section 4.3.2 provides a first solution for the case of undirected
√ BFS. How-
ever, it also raises new questions: For example, is Ω(|V |/ D · B) I/Os a lower
bound for sparse graphs? Can similar results be obtained for DFS or SSSP
with arbitrary nonnegative edge weights? In the following we shortly discuss
difficulties concerning possible extensions towards directed BFS.
The improved I/O-bound of Fast BFS stems from partitioning the undi-
rected input graph G into disjoint node sets Si such that the distance in G
between any two nodes of Si is relatively small. For directed graphs, a par-
titioning of that kind may not always exist. It could be beneficial to have a
larger number of overlapping node sets where the algorithm accesses just a
small fraction of them. Similar ideas underlie a previous BFS algorithm for
undirected graphs with small node degrees [550]. However, there are deep
problems concerning space blow-up and efficiently identifying the “right”
partitioning for arbitrary directed graphs. Besides all that, an o(|V |)-I/O
4. Elementary Graph Algorithms in External Memory 77

algorithm for directed BFS must also feature novel strategies to remember
previously visited nodes. Maybe, for the time being, this additional compli-
cation should be left aside by restricting attention to the semi-external case;
ﬁrst results for semi-external BFS on directed Eulerian graphs are given in
[542].

4.8 Graph Connectivity: Undirected CC, BCC, and

MSF

The Connected Components(CC) problem is to create, for a graph G =

(V, E), a list of the nodes of the graph, sorted by component, with a spe-
cial record marking the end of each component. The Connected Components
Labeling (CCL) problem is to create a vector L of size |V | such that for every
i, j ∈ {1, · · · , |V |}, L[i] = L[j] iff nodes i and j are in the same component
of G. A solution to CCL can be converted into a CC output in O(sort(|V |))
I/Os, by sorting the nodes according to their component label and adding
the separators.
Minimum Spanning Forest (M SF ) is the problem of finding a subgraph F
of the input graph G such that every connected component in G is connected
in F and the total weight of the edges of F is minimal.
Biconnected Components (BCC) is the problem of finding subsets of the
nodes such that u and v are in the same subset iff there are two node-disjoint
paths between u and v.
CC and MSF are related, as an MSF yields the connected components
of the graph. Another common point is that typical algorithms for both
problems involve reading the adjacency lists of nodes. Reading a node’s ad-
jacency list L takes O(scan(|L|)) I/Os, so going over the nodes in an arbi-
trary order and reading each node’s adjacency list once requires a total of
O(scan(|E|) + |V |) I/Os. If |V | ≤ |E|/B, the scan(|E|) term dominates.
For both CC and MST, we will see algorithms that have this |V | term
in their complexity. Before using such an algorithm, a node reduction step
will be applied to reduce the number of nodes to at most |E|/B. Then,
the |V | term is dominated by the other terms in the complexity. A node
reduction step should have the property that the transformations it performs
on the graph preserve the solutions to the problem in question. It should
also be possible to reintegrate the nodes or edges that were removed into
the solution of the problem for the reduced graph. For both CC and MST,
the node reduction step will apply a small number of phases of an iterative
algorithm for the problem. We could repeat such phases until completion, but
that would require Θ (log log |V |) phases, each of which is quite expensive.
The combination of a small number, O(log log(|V |B/|E|)), of phases with
an efficient algorithm for dense graphs, gives a better complexity. Note that
when |V | ≤ |E|, O(sort(|E|) log log(|V |B/|E|)) = O(sort(|E|) log log B).
78 Irit Katriel and Ulrich Meyer

For BCC, we show an algorithm that transforms the graph and then
applies CC. BCC is clearly not easier than CC; given an input G = (E, V )
to CC, we can construct a graph G by adding a new node and connecting it
with each of the nodes of G. Each biconnected component of G is the union
of a connected component of G with the new node.

4.9 Connected Components

The undirected BFS algorithm of Section 4.3.1 can trivially be modiﬁed to

solve the CCL problem: whenever L(t−1) is empty, the algorithm has spanned
a complete component and selects some unvisited node for L(t). It can there-
fore number the components, and label each node in L(t) with the current
component number before L(t) is discarded. This adds another O(|V |) I/Os,
so the complexity is still O(|V | + sort(|V | + |E|)). For a dense graph, where
|V | ≤ |E|/B, we have O(|V | + sort(|V | + |E|)) = O(sort(|V | + |E|)). For a
general graph, Munagala and Ranade [567] suggest to precede the BFS run
with the following node reduction step:
The idea is to find, in each phase, sets of nodes such that the nodes in
each set belong to the same connected component. Among each set a leader
is selected, all edges between nodes of the set are contracted and the set is
reduced to its leader. The process is then repeated on the reduced graph.
Algorithm 1. Repeat until |V | ≤ |E|/B:
1. For each node, select a neighbor. Assuming that each node has a unique
integer node-id, let this be the neighbor with the smallest id. This par-
titions the nodes into pseudotrees (directed graphs where each node has
outdegree 1).
2. Select a leader from each pseudotree.
3. Replace each edge (u, v) in E by an edge (R(u), R(v)), where R(v) is the
leader of the pseudotree v belongs to.
4. Remove isolated nodes, parallel edges and self loops.
For Step 1 we sort two copies of the edges, one by source node and one by
target node, and scan both lists simultaneously to find the lowest numbered
neighbor of each node.
For step 2 we note that a pseudotree is a tree with one additional edge.
Since each node selected its smallest neighbor, the smallest node in each
pseudotree must be on the pseudotree’s single cycle. In addition, this node
can be identified by the fact that the edge selected for it goes to a node with
a higher ID. Hence, we can scan the list of edges of the pseudoforest, remove
each edge for which the source is smaller than the target, and obtain a forest
while implicitly selecting the node with the smallest id of each pseudotree as
the leader. On a forest, Step 3 can be implemented by pointer doubling as
in the algorithm for list ranking in Chapter 3 with O(sort(|E|)) I/Os. Step 4
4. Elementary Graph Algorithms in External Memory 79

requires an additional constant number of sorts and scans of the edges. The
total I/Os for one iteration is then O(sort(|E|)). Since each iteration at least
halves the number of nodes, log2 (|V |B/|E|) iterations are enough, for a total
of O(sort(|E|) log(|V |B/|E|)) I/Os.
After log2 (|V |B/|E|) iterations, we have a contracted graph in which each
node represents a set of nodes from the original graph. Applying BFS on the
contracted graph gives a component label to each supernode. We then need
to go over the nodes of the original graph and assign to each of them the
label of the supernode it was contracted to. This can be done by sorting the
list of nodes by the id of the supernode that the node was contracted to and
the list of component labels by supernode id, and then scanning both lists
simultaneously.
The complexity can be further improved by contracting√ more edges per phase
at the same cost. More precisely, in phase i up to S i edges adjacent to each
i 3/2
node will be contracted, where Si = 2(3/2) = Si−1 (less edges will be con-
tracted only if some nodes have become singletons, in which case they become
inactive). This means that the number of active nodes at the beginning of
phase i, |Vi |, is at most |V |/Si−1 · Si−2 ≤ |V |/(Si )2/3 (Si )4/9 ≤ |V |/Si , and
log log(|V |B/|E|) phases are suﬃcient to reduce the number of nodes as de-
sired.
To stay within the same complexity per phase, phase i is executed on a
reduced graph Gi , which contains only the relevant √ edges: those that will be
contracted in the current phase. Then |Ei | ≤ |Vi | Si . We will later see how
this helps, but ﬁrst we describe the algorithm:
Algorithm 2. Phase i:
√
1. For each active node, select up to d = Si adjacent edges (less if the
node’s degree is smaller). Generate a graph Gi = (Vi , Ei ) over the active
nodes with the selected edges.
2. Apply log d phases of Algorithm 1 to Gi .
3. Replace each edge (u, v) in E by an edge (R(u), R(v)) and remove re-
dundant edges and nodes as in Algorithm 1.
Complexity Analysis. Steps 1 an 3 take O(sort(|E|)) I/Os as in Algo-
rithm 1. In√Step 2, each phase of Algorithm 1 takes O(sort(|Ei |)) I/Os. With
|Ei | ≤ |Vi | √ to Step 1)√and |Vi | ≤ (|V |/Si ) (as shown above), this is
Si (due
O sort(|V |/ Si ) and all log Si phases need a total of O(sort(|V |)) I/Os.
Hence, one phase of Algorithm 2 needs O(sort(|E|)) I/Os as before, giving
a total complexity of O(sort(|E|) log log(|V |B/|E|)) I/Os for the node re-
duction. The BFS-based CC algorithm can then be executed in O(sort(|E|))
I/Os.
For D > 1, we perform node reduction phases until |V | ≤ |E|/BD, giving:
Theorem 4.7 ([567]). CC can be solved using
O(sort(|E|) · max{1, log log(|V |BD/|E|)}) I/Os.
80 Irit Katriel and Ulrich Meyer

4.10 Minimum Spanning Forest

An O(|V | + sort(|E|)) MSF algorithm. The Jarnı́k-Prim [428, 615]

MSF algorithm builds the MSF incrementally, one tree at a time. It starts
with an arbitrary node and in each iteration, finds the node that is connected
to the current tree by the lightest edge, and adds it to the tree. When there
do not exist any more edges connecting the current tree with unvisited nodes,
it means that a connected component of the graph has been spanned. The
algorithm then selects an arbitrary unvisited node and repeats the process
to find the next tree of the forest. For this purpose, the nodes which are not
in the MSF are kept in a priority queue, where the priority of a node is the
weight of the lightest edge that connects it to the current MSF, and ∞ if
such an edge does not exist. Whenever a node v is added to the MSF, the
algorithm looks at v’s neighbors. For each neighbor u which is in the queue,
the algorithm compares the weight of the edge (u, v) with the priority of u. If
the priority is higher than the weight of the new edge that connects u to the
tree, u’s priority is updated. In EM, this algorithm requires at least one I/O
per edge to check the priority of a node, hence the number of I/Os Θ (|E|).
However, Arge et al. [60] pointed out that if edges are kept in the queue
instead of nodes, the need for updating priorities is eliminated. During the
execution of the algorithm, the queue contains (at least) all edges that con-
nect nodes in the current MSF with nodes outside of it. It can also contain
edges between nodes that are in the MSF. The algorithm proceeds as fol-
lows: repeatedly perform extract minimum to extract the minimum weight
edge (u, v) from the queue. If v is already in the MSF, the edge is discarded.
Otherwise, v is included in the MSF and all edges incident to it, except for
(u, v), are inserted into the queue. Since each node in the tree inserts all its
adjacent edges, if v is in the MSF, the queue contains two copies of the edge
(u, v). Assuming that all edge weights are distinct, after the first copy was
extracted from the queue, the second copy is the new minimum. Therefore,
the algorithm can use one more extract-minimum operation to know whether
the edge should be discarded or not.
The algorithm reads each node’s adjacency list once, with O(|V | + |E|/B)
I/Os. It also performs O(|E|) insert and extract minimum operations on
the queue. By using
an external memory
priority queue that supports these
operations in O 1/B logM/B (N/B) I/Os amortized, we get that the total
complexity is O(|V | + sort(|E|)).
Node Reduction for MSF. As in the CC case, we wish to reduce the
number of nodes to at most |E|/B. The idea is to contract edges that are
in the MSF, merging the adjacent nodes into supernodes. The result is a
reduced graph G such that the MSF of the original graph is the union of
the contracted edges and the edges of the MSF of G , which can be found
recursively.
4. Elementary Graph Algorithms in External Memory 81

A Boruvka [143] step selects for each node the minimum-weight edge in-
cident to it. It contracts all the selected edges, replacing each connected com-
ponent they deﬁne by a supernode that represents the component, removing
all isolated nodes, self-edges, and all but the lowest weight edge among each
set of multiple edges.
Each Boruvka step reduces the number of nodes by at least a factor of two,
while contracting only edges that belong to the MSF. After i steps, each su-
pernode represents a set of at least 2i original nodes, hence the number of su-
pernodes is at most |V |/2i . In order to reduce the number of nodes to |E|/B,
log(|V |B/|E|) phases are necessary. Since one phase requires O(sort(|E|))
I/Os, this algorithm has complexity of O(sort(|E|) · max{1, log(|V |B/|E|))}).
As with the CC node reduction algorithm, this can be improved by com-
bining phases into superphases, where each superphase still needs
O(sort(|E|)) I/Os, and reduces more nodes than the basic step. Each su-
perphase is the same, √ except that the edges selected for Ei are not √the
smallest numbered Si edges adjacent to each node,√but the lightest Si
edges. Then a superphase which is equivalent to log si Boruvka steps is
executed with O(sort(|E|)) I/Os. The total number of I/Os for node reduc-
tion is O(sort(|E|) · max{1, log log(|V |B/|E|)}). The output is the union of
the edges that were contracted in the node reduction phase and the MSF of
the reduced graph.
Theorem 4.8 ([567]). MSF can be solved using

O(sort(|E|) · max{1, log log(|V |BD/|E|)}) I/Os.

4.11 Randomized CC and M SF

We now turn to the randomized MSF algorithm that was proposed by Abello,
Buchsbaum and Westbrook [1], which is based on an internal memory algo-
rithm that was found by Karger, Klein and Tarjan [445]. As noted above, an
MSF algorithm is also a CC algorithm.
Let G = (V, E) be the input graph with edge weights c(e). Let E ⊆ E be
an arbitrary subset of the edges of G and let F be the M SF of G = (V, E ).
Denote by WF (u, v) the weight of the heaviest edge on the path connecting
u and v in F . If such a path does not exist, WF (u, v) = ∞.
Lemma 4.9. For any edge e = (u, v) ∈ E, if w(e) ≥ WF (u, v), then there
exists an MSF that does not include e.
The lemma follows immediately from the cycle property: the heaviest edge of
each cycle is not in the MSF.

The following MSF algorithm is based on Lemma 4.9 [445]:
1. Apply two Boruvka phases to reduce the number of nodes by at least a
factor of 4. Call the contracted graph G .
82 Irit Katriel and Ulrich Meyer

2. Choose a subgraph H of G by selecting each edge independently with

probability p.
3. Apply the algorithm recursively to find the MSF F of H.
4. Delete from G each edge e = (u, v) for which w(e) > WF (u, v). Call the
resulting graph G .
5. Apply the algorithm recursively to find the MSF F of G .
6. return the union of the edges contracted in step 1 and the edges of F .
Step 1 requires O(sort(|E|)) I/Os and Step 2 an additional O(scan(|E|))
I/Os.
Step 4 can be done with O(sort(|E|)) I/Os as well [192]. The idea is based
on the MST verification algorithm of King [456], which in turn uses a sim-
plification of Komlós’s algorithm for finding the heaviest weight edge on a
path in a tree [464]. Since Komlós’s algorithm only works on full branching
trees4 , the M SF F is first converted into a forest of full branching trees F
such that WF (u, v) = WF (u, v) and F has O(|V |) nodes. The conversion is
done as follows: each node of F is a leaf in F . A sequence of Boruvka steps
is applied to F and each step defines one level of each tree in F : the nodes
at distance i from the leaves are the supernodes that exist after i Boruvka
steps and the parent of each non-root node is the supernode into which is was
contracted. Since contraction involves at least two nodes, the degree of each
non-leaf node is at least 2. All Boruvka steps require a total of O(sort(|V |))
I/Os.
The next step is to calculate, for each edge (u, v) in G , the least common
ancestor (LCA) of u and v in F . Chiang et al. show that this can be done
with O(sort(|E|)) I/Os [192]. An additional O(sort(|E|)) I/Os are necessary
to construct a list of tuples, one for each edge of G , of the weight of the edge,
its endpoints and the LCA of the endpoints. These tuples are then filtered
through F with O((|E|/|V |)sort(|V |)) = O(sort(|E|)) I/Os as follows: each
tuple is sent as a query to the root of the tree and traverses down from it
towards the leaves that represent the endpoints of the edge. When the query
reaches the LCA, it is split into two weight-queries, one continuing towards
each of the endpoints. If a weight query traverses a tree edge which is heavier
than the query edge, the edge is not discarded. Otherwise, it is. To make this
I/O efficient, queries are passed along tree paths using batch filtering [345].
This means that instead of traversing the tree from root to leaves for each
query, a batch of |V | queries is handled at each node before moving on to the
next.
The non-recursive stages then require O(sort(|E|)) I/Os. Karger et al.,
have shown that the expected number of edges in G is |V |/p [445]. The
algorithm includes two recursive calls. One with at most |V |/4 nodes and
expected p|E| edges, and the other with at most |V |/4 nodes and expected
|V |/p edges. The total expected I/O complexity is then:
4
A full branching tree is a tree in which all leaves are at the same level and each
non-leaf node has at least two descendants
4. Elementary Graph Algorithms in External Memory 83

|V | |V | |V |
t(|E|, |V |) ≤ O(sort(|E|)) + t p|E|, +t , = O(sort(|E|))
4 p 4
Theorem 4.10 ([1]). MSF and CC can be solved by a randomized algorithm
that uses O(sort(|E|)) I/Os in the expected case.

4.12 Biconnected Components

Tarjan and Vishkin [714] propose a parallel algorithm that reduces BCC to
an instance of CC. The idea is to transform the graph G into a graph G
such that the connected components of G correspond to the biconnected
components of G. Each node of G represents an edge of G. If two edges
e1 and e2 are on the same simple cycle in G, the edge (e1 , e2 ) exists in G .
The problem with this construction
2 is that the graph G is very large; it has
|E| nodes and up to Ω |E| edges. However, they show how to generate a
smaller graph G with the desired properties: instead of including all edges
of G , first find (any) rooted spanning tree T of G and generate G as the the
subgraph of G induced by the edges of T . Formally, we say that two nodes
of T are unrelated if neither is a predecessor of the other. G then includes
the edges:
1. {(u, v), (x, w)} such that (u, v), (x, w) ∈ T , (v, w) ∈ G − T and v, w are
unrelated in T .
2. {(u, v), (v, w)} such that (u, v), (v, w) ∈ T and there exists an edge in G
between a descendant of w and a non-descendant of v.
It is not difficult to see that the connected components of G correspond to
biconnected components of G, and that the number of edges in G is O(|E|).
Chiang et al. [192] adapt this algorithm to the I/O model. Finding an arbi-
trary spanning tree is obviously not more difficult than finding an MST, which
can be done with O(sort(|E|) · max{1, log log(|V |BD/|E|)}) I/Os (Theorem
4.8). To construct G , we can use Euler tour and list ranking (Chapter 3) to
number the nodes by preorder and find the number of descendants of each
node, followed by a constant number of sorts and scans of the edges to check
the conditions described above. Hence, the construction of G after we have
found the MSF needs O(sort(|V |)) I/Os. Finding the connected components
of G has the same complexity as MSF (Theorem 4.7). Deriving the bicon-
nected components of G from the connected components of G can then be
done with a constant number of sorts and scans of the edges of G and G .
Theorem 4.11 ([192]). BCC can be solved using
O(sort(|E|) · max{1, log log(|V |BD/|E|)}) I/Os.
Using the result of Theorem 4.10 we get:
Theorem 4.12. BCC can be solved by a randomized algorithm using
O(sort(|E|)) I/Os.
84 Irit Katriel and Ulrich Meyer

4.13 Conclusion for Graph Connectivity

Munagala and Ranade [567] prove a lower bound of Ω (|E|/|V | · sort(|V |))
I/Os for CC, BCC and MSF. Note that |E|/|V | · sort(|V |) = Θ (sort(|E|)).

We have surveyed randomized algorithms that achieve this bound, but the
best known deterministic algorithms have a slightly higher I/O complexity.
Therefore, while both deterministic and randomized algorithms are eﬃcient,
there still exists a gap between the upper bound and the lower bound in the
deterministic case.
Acknowledgements We would like to thank the participants of the GI-
Dagstuhl Forschungsseminar “Algorithms for Memory Hierarchies” for a
number of fruitful discussions and helpful comments on this chapter.
5. I/O-Eﬃcient Algorithms for Sparse Graphs
Laura Toma and Norbert Zeh∗

5.1 Introduction
Massive graphs arise naturally in many applications. Recent web crawls, for
example, produce graphs with on the order of 200 million nodes and 2 billion
edges. Recent research in web modelling uses depth-first search, breadth-first
search, and the computation of shortest paths and connected components as
primitive routines for investigating the structure of the web [158]. Massive
graphs are also often manipulated in Geographic Information Systems (GIS),
where many problems can be formulated as fundamental graph problems.
When working with such massive data sets, only a fraction of the data can
be held in the main memory of a state-of-the-art computer. Thus, the trans-
fer of data between main memory and secondary, disk-based memory, and
not the internal memory computation, is often the bottleneck. A number of
models have been developed for the purpose of analyzing this bottleneck and
designing algorithms that minimize the traffic between main memory and
disk. The algorithms discussed in this chapter are designed and analyzed in
the parallel disk model (PDM) of Vitter and Shriver [755]. For a definition
and discussion of this model, the reader may refer to Chapter 1.
Despite the efforts of many researchers [1, 7, 52, 53, 60, 164, 192, 302, 419,
485, 521, 522, 550, 567, 737], the design of I/O-efficient algorithms for basic
graph problems is still a research area with many challenging open problems.
For most graph problems, Ω(perm(|V |)) or Ω(sort(|V |)) are lower bounds on
the number of I/Os required to solve them [53, 192], while the best known
algorithms for these problems on general graphs perform considerably more
I/Os. For example, the best known algorithms for DFS and SSSP perform
Ω(|V |) I/Os in the worst case; for BFS an algorithm performing o(|V |) I/Os
has been proposed only recently (see Table 5.1). While these algorithms are
I/O-efficient for graphs with at least B · |V | edges, they are inefficient for
sparse graphs.
In this chapter we focus on algorithms that solve a number of funda-
mental graph problems I/O-efficiently on sparse graphs. The algorithms we
discuss, besides exploiting the combinatorial and geometric properties of spe-
cial classes of sparse graphs, demonstrate the power of two general techniques
applied in I/O-efficient graph algorithms: graph contraction and time-forward
processing. The problems we consider are computing the connected and bi-
connected components (CC and BCC), a minimum spanning tree (MST), or
∗
Part of this work was done when the second author was a Ph.D. student at the
School of Computer Science of Carleton University, Ottawa, Canada.

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 85-109, 2003.
© Springer-Verlag Berlin Heidelberg 2003
86 Laura Toma and Norbert Zeh

Table 5.1. The best known upper bounds for fundamental graph problems on
undirected graphs. The algorithms are deterministic and use linear space.

Problem General undirected graphs

|V |B
CC, MST O sort(|E|) log log |E|
[567, 60]

|E| |V |
SSSP O |V | + B
· log B
[485]

|V |
DFS O |V | + M
· scan(E) [192]
O ((|V | + scan(|E|)) · log 2 |V |) [485]

|V ||E| |V |B
BFS O B
+ sort(|V | + |E|) log log |E|
[542]

an ear decomposition of the given graph, breadth-ﬁrst search (BFS), depth-

first search (DFS), and single source shortest paths (SSSP). The first four
problems can be categorized as connectivity problems, while the latter three
can be considered graph searching problems.
In our discussion we emphasize that graph contraction almost imme-
diately leads to I/O-efficient solutions for connectivity problems on sparse
graphs because edge contractions preserve the connectivity of the graph. For
graph searching problems, the algorithms we discuss exploit structural prop-
erties of the graph classes we consider such as having small separators, out-
erplanar or planar embeddings, or tree-decompositions of constant width.
The graph classes we consider are outerplanar graphs, planar graphs, grid
graphs and graphs of bounded treewidth. A crucial condition for exploiting
the structural properties of these graph classes in algorithms is the ability
to compute such information explicitly. We discuss I/O-efficient algorithms
that compute outerplanar and planar embeddings of outerplanar and planar
graphs, tree-decompositions of graphs of bounded treewidth, and small sep-
arators of graphs in any of these classes. Table 5.2 gives an overview of the
algorithms for sparse graphs discussed in this chapter.
In Section 5.2 we define the graph classes we consider. In Section 5.3
we describe the two fundamental algorithmic techniques used to solve graph
problems I/O-efficiently on sparse graphs. In each of Sections 5.4 through 5.8
we discuss a different graph problem. Sections 5.5 through 5.8 are further
divided into subsections describing the different solutions of the considered
problem for different classes of sparse graphs. While we present these so-
lutions separately, we emphasize the common ideas behind these solutions.
We conclude in Section 5.9 with a summary and a discussion of some open
problems.
5. I/O-Efficient Algorithms for Sparse Graphs 87

Table 5.2. The problems that can be solved in O(sort(N )) I/Os and linear space
on sparse graphs and the sections where they are discussed. A left-arrow indicates
that the problem can be solved using the more general algorithm to the left.

Problem Graph class

sparse planar grid outerplanar bounded
treewidth
CC, BCC, MST 5.4 ← ← ← ←
Ear decomposition 5.4 ← ← ← ←
BFS + SSSP open 5.5.1 5.5.1 5.5.2 5.5.2
DFS open 5.6.1 open(5.6.2) 5.6.3 open
Graph partition N/A 5.7.1 5.7.2 5.7.3 5.7.3
Embedding N/A 5.8.1 N/A 5.8.3 N/A
Tree-decomposition N/A N/A N/A 5.8.3 5.8.2

5.2 Deﬁnitions and Graph Classes

The notation and terminology used in this chapter is quite standard. The
reader may refer to [283, 382] for definitions of basic graph-theoretic concepts.
For clarity, we review a few basic definitions and define the graph classes
considered in this chapter.
Given a graph G = (V, E) and an edge (v, w) ∈ E, the contraction of
edge (v, w) is the operation of replacing vertices v and w with a new vertex x
and every edge (u, y), where u ∈ {v, w} and y ∈ {v, w}, with an edge (x, y).
This may introduce duplicate edges into the edge set of G. These edges are
removed. We call graph G sparse if |E | = O(|V |) for any graph H = (V , E )
that can be obtained from G through a series of edge contractions.1
A planar embedding Ĝ of a graph G = (V, E) is a drawing of G in the
plane so that every vertex is represented as a unique point, every edge is
represented as a contiguous curve connecting its two endpoints, and no two
edges intersect, except possibly at their endpoints. A graph G is planar if it
has a planar embedding. Given an embedded planar graph, the faces of G
are the connected components of R2 \ Ĝ. The boundary of a face f is the set
of vertices and edges contained in the closure of f .
A graph G = (V, E) is outerplanar if it has a planar embedding one of
whose faces has all vertices of G on its boundary. We call this face the outer
face of G.
√ A grid
√ graph is a graph whose vertices are a subset of the vertices of a
N × N regular grid. Every vertex is denoted by its coordinates (i, j) and
1
The authors of [192] call these graphs “sparse under edge contraction”, thereby
emphasizing the fact that the condition |E| = O(|V |) is not sufficient for a graph
to belong to this class.
88 Laura Toma and Norbert Zeh
1
9 10
7 8 2
4 1
3
6 10
9 5 2
8 4
4 10 1
6 8 10
8
4 4
6 5 5 4
5
1
8
7
4
4 3
6 5

Fig. 5.1. A graph of treewidth three and a tree-decomposition of width three for
the graph.

can be connected only to those vertices whose coordinates diﬀer by at most

one from its own coordinates. Note that a grid graph is not necessarily planar
because diagonal edges may intersect.
A tree-decomposition of a graph G = (V, E) is a pair D = (T, X ), where
T = (I, F ) is a tree and X = (Xi )i∈I is a collection of vertex sets satisfying
the following properties (see Fig. 5.1):

(i) i∈I Xi = V ,
(ii) For every edge (v, w) ∈ E, there exists a node i ∈ I so that v, w ∈ Xi ,
and
(iii) For two nodes i, k ∈ T and any node j on the path from i to k in T ,
Xi ∩ Xk ⊆ Xj .
The width of tree-decomposition D is deﬁned as max{|Xi | : i ∈ I} − 1. The
treewidth of graph G is the minimum width of all its tree-decompositions.
Intuitively, the treewidth of a graph measures how close the graph is to be-
ing a tree or forest. A class C of graphs is said to have bounded treewidth if
there exists a constant k so that all graphs in C have treewidth at most k.
Outerplanar graphs, for example, have treewidth two. For algorithmic pur-
poses, a particular type of tree-decomposition is especially useful: A nice
tree-decomposition of a graph G is a tree-decomposition D = (T, X ) with the
following additional properties:
(iv) Tree T is a rooted binary tree, and
(v) Every internal node of T is of one of the following types: A forget node
i ∈ T has one child j, and Xi = Xj \ {x}, for some vertex x ∈ Xj . An
introduce node i ∈ T has one child j, and Xi = Xj ∪{x}, for some vertex
5. I/O-Eﬃcient Algorithms for Sparse Graphs 89

x ∈ Xj . A join node i ∈ T has two children j and k, and Xi = Xj = Xk .

The leaves of T are also referred to as start nodes.
Bodlaender and Kloks [136] show that every graph of treewidth k has a
nice tree-decomposition of width k and size O(N ), where the size of a tree-
decomposition is the number of nodes in T .
The relationships between the diﬀerent graph classes are visualized in
Fig. 5.2. In general, the more classes a graph belongs to, the more restrictive
is its structure, so that it is not surprising that outerplanar graphs allow very
simple solutions to the problems discussed in this chapter.

Sparse

Planar
Outer-
planar

Grid Bounded
treewidth

Fig. 5.2. The relationships between the diﬀerent graph classes considered in this
survey.

5.3 Techniques
Before discussing the particular algorithms in this survey, we sketch the two
fundamental algorithmic techniques used in these algorithms: graph contrac-
tion and time-forward processing.

5.3.1 Graph Contraction

At a very abstract level, graph contraction is simple and elegant: Identify a

number of edge-disjoint subgraphs of G so that representing each such sub-
graph by a graph of smaller size reduces the size of G by a constant factor
and preserves the properties of interest. This approach is often taken in par-
allel graph algorithms, where the contraction procedure is applied recursively
q = O(log N ) times, in order to produce a sequence G = G0 , G1 , . . . , Gq of
graphs with |Gq | = O(1). Then the problem is solved in O(1) time on Gq . A
90 Laura Toma and Norbert Zeh

solution for graph G is constructed by undoing the contraction steps and at

each step deriving a solution for Gi from the given solution for graph Gi+1 .
When designing I/O-efficient algorithms, the contraction can usually stop
after q = O(log B) contraction levels. At that point, the resulting graph is
guaranteed to have O(|V |/B) vertices, so that the algorithm can afford to
spend O(1) I/Os per vertex to solve the problem on the contracted graph.
The edges can usually be handled using I/O-efficient data structures.

5.3.2 Time-Forward Processing

Time-forward processing is a very elegant technique for evaluating directed

acyclic graphs. This technique has been proposed in [192] and improved
in [52]. Formally, the following problem can be solved using this technique:
Let G be a directed acyclic graph (DAG) whose vertices are numbered
so that every edge in G leads from a vertex with lower number to a ver-
tex with higher number. That is, this numbering is a topological number-
ing of G. Let every vertex v of G store a label φ(v), and let f be a func-
tion to be applied in order to compute for every vertex v, a new label
ψ(v) = f (φ(v), λ(u1 ), . . . , λ(uk )), where u1 , . . . , uk are the in-neighbors of v
in G, and λ(ui ) is some piece of information “sent” from ui to v after com-
puting ψ(ui ). The goal is to “evaluate” G, i.e., to compute ψ(v), for all
vertices v ∈ G.
While time-forward processing does not solve the problem of comput-
ing ψ(v) I/O-efficiently in the case where the input data φ(v) and
λ(u1 ), . . . , λ(uk ) do not fit into internal memory, it provides an elegant way
to supply vertex v with this information at the time when ψ(v) is computed.
The idea is to process the vertices in G by increasing numbers. This guaran-
tees that all in-neighbors of vertex v are evaluated before v. Thus, if these
in-neighbors “send” their outputs λ(u1 ), . . . , λ(uk ) to v, v has these inputs
and its own label φ(v) at its disposal to compute ψ(v). After computing ψ(v),
v sends its output λ(v) “forward in time” to its out-neighbors, which guar-
antees that these out-neighbors have λ(v) at their disposal when it is their
turn to be evaluated.
The implementation of this technique due to Arge [52] uses a priority
queue Q to realize the “sending” of information along the edges of G. When
a vertex v wants to send its output λ(v) to another vertex w, it inserts
λ(v) into priority queue Q and gives it priority w. When vertex w is being
evaluated, it removes all entries with priority w from Q. As every in-neighbor
of w sends its output to w by queuing it with priority w, this provides w
with the required inputs. Moreover, every vertex removes its inputs from the
priority queue when it is evaluated, and all vertices with smaller numbers are
evaluated before w. Thus, the entries in Q with priority w are in fact those
with lowest priority in Q at the time when w is evaluated. Therefore they can
be removed using a sequence of DeleteMin operations. Since this procedure
5. I/O-Efficient Algorithms for Sparse Graphs 91

involves O(|V | + |E|) priority queue operations, graph G can be evaluated in

O(sort(|V | + |E|)) I/Os using an I/O-eﬃcient priority queue (see Chapter 2).
The power of time-forward processing lies in the fact that many prob-
lems on undirected graphs can be expressed as evaluation problems of DAGs
derived from these graphs.

5.4 Connectivity Problems

5.4.1 Connected Components

The I/O-efficient connectivity algorithm of [192] uses ideas from the PRAM
algorithm of Chin et al. [197] for this problem. First the graph contraction
technique from Section 5.3.1 is applied in order to compute a sequence G =
G0 , . . . , Gq of graphs whose vertex sets have geometrically decreasing sizes
and so that the vertex set of graph Gq fits into main memory. The latter
implies that the connected components of Gq can be computed using a simple
semi-external connectivity algorithm as outlined below. Given the connected
components of Gq , the connected components of G are computed by undoing
the contraction steps used to construct graphs G1 , . . . , Gq one by one and in
each step computing the connected components of Gi from those of Gi+1 .
The details of the algorithm are as follows:
In order to compute graph Gi+1 from graph Gi during the contraction
phase, every vertex in Gi = (Vi , Ei ) selects its incident edge leading to its
neighbor with smallest number. The selected edges form a forest Fi each of
whose trees contains at least two vertices. Every tree in Fi is then contracted
into a single vertex, which produces a new graph Gi+1 = (Vi+1 , Ei+1 ) with at
most half as many vertices as Gi . In particular, for 0 ≤ i ≤ q, |Vi | ≤ |V |/2i .
Choosing q = log(|V |/M ), this implies that |Vq | ≤ M , i.e., the vertex set
of Gq fits into main memory. Hence, the connected components of Gq can be
computed using the following simple algorithm:
Load the vertices of Gq into main memory and label each of them as
being in a separate connected component. Now scan the edge set of Gq and
merge connected components whenever the endpoints of an edge are found
to be in different connected components. The computation of this algorithm
is carried out in main memory, so that computing the connected components
of Gq takes O(scan(|Vq | + |Eq |)) I/Os.
To construct the connected components of graph Gi from those of
graph Gi+1 when undoing the contraction steps, all that is required is to
replace each vertex v of Gi with the tree in Fi it represents and assign v’s
component label to all vertices in this tree.
In [192] it is shown that the construction of Gi+1 from Gi as well
as computing the connected components of Gi from those of Gi+1 takes
O(sort(|Ei |)) I/Os. Hence, the whole connectivity algorithm takes
log(|V |/M)
i=0 O(sort(|Ei |)) I/Os. Since the graphs we consider are sparse,
92 Laura Toma and Norbert Zeh
log(|V |/M)
|Ei | = O(|Vi |) = O(|V |/2i ), so that i=0 O(sort(|Ei |)) = O(sort(|V |)).
That is, the contraction-based connectivity algorithm computes the con-
nected components of a sparse graph in O(sort(|V |)) I/Os.

5.4.2 Minimum Spanning Tree

The algorithm outlined in Section 5.4.1 can be modiﬁed so that it computes

an MST (a minimum spanning forest if the graph is disconnected). Instead of
selecting for every vertex, the edge connecting it to its neighbor with smallest
number, the MST-algorithm chooses the edge of minimum weight incident
to each vertex. The weight of an edge in Gi is the minimum weight of its
corresponding edges in G. When an edge in Gi is selected, its corresponding
minimum weight edge in G is added to the MST. For details and a correct-
ness proof see [192, 197]. Clearly, these modiﬁcations do not increase the
I/O-complexity of the connectivity algorithm. Hence, the algorithm takes
O(sort(|V |)) I/Os on sparse graphs.

5.4.3 Biconnected Components

Tarjan and Vishkin [714] present an elegant parallel algorithm to compute the
biconnected components of a graph G by computing the connected compo-
nents of an auxiliary graph H. Given a spanning tree T of G, every non-tree
edge (v, w) of G (i.e., (v, w) ∈ E(G)\E(T )) deﬁnes a fundamental cycle, which
consists of the path from v to w in T and edge (v, w) itself. The auxiliary
graph H contains one vertex per edge of G. Two vertices in H are adjacent if
the corresponding edges appear consecutively on a fundamental cycle in G.
Using this deﬁnition of H, it is easy to verify that two edges of G are in the
same biconnected component of G if the two corresponding vertices in H are
in the same connected component of H. In [192], Chiang et al. show that
the construction of H from G can be carried out in O(sort(|V | + |E|)) I/Os.
Since H has O(|E|) vertices and edges, the connected components of H can
be computed in O(sort(|E|)) I/Os using the connectivity algorithm from
Section 5.4.1. Hence, the biconnected components of G can be computed
in O(sort(|V |)) I/Os if G is sparse.

5.4.4 Ear Decomposition

An ear decomposition E = (P0 , P1 , . . . , Pk ) of a graph G is an ordered parti-

tion of G into edge-disjoint simple paths Pi with endpoints si and ti . Ear P0
is an edge. For 1 ≤ i ≤ k, ear Pi shares its two endpoints si and ti , but none
of its internal vertices, with the union P0 ∪ · · · ∪ Pi−1 of all previous ears. An
ear Pi is open if si = ti . Ear decomposition E is open if all its ears are open.
For a graph to have an ear decomposition, it has to be two-edge connected.
That is, there cannot be an edge whose removal disconnects the graph. For
5. I/O-Eﬃcient Algorithms for Sparse Graphs 93

the graph to have an open ear decomposition, it has to be 2-vertex connected

(biconnected). That is, there cannot be a vertex whose removal disconnects
the graph. So let G be a 2-edge connected graph, and let E be an ear decom-
position of G. Removing an arbitrary edge from each ear in E results in a
spanning tree of G. Conversely, an ear decomposition of G can be obtained
from any spanning tree T of G as follows: Consider the fundamental cycles
defined by the non-tree edges of G and sort them by their distances in T
from the root r of T , where the distance of a cycle from r is the minimum
distance of its vertices from r. Remove from each fundamental cycle all those
edges that are already contained in a previous cycle. It can be shown that
the resulting sorted set of cycles and paths is an ear decomposition of G.
In order to obtain a parallel ear decomposition algorithm, Maon et al. [530]
propose the following implementation of the above idea: Their algorithm
assigns labels to the edges of G so that two edges are in the same ear if and
only if they have the same label. For a non-tree edge e = (v, w), let LCA(e) be
the lowest common ancestor of v and w in T , and let depth(e) be the distance
of LCA(e) from the root of T . Then label(e) = (depth(e), e). For a tree-edge e,
label(e) is the minimum label of all non-tree edges so that edge e is on the
fundamental cycles defined by these edges. Now every non-tree edge e defines
a cycle or simple path Pe with edge set {e ∈ G : label(e ) = label(e)}. Maon
et al. show that the collection of these cycles and paths, sorted by their labels,
is an ear decomposition of G. This ear decomposition is not necessarily open,
even if G is biconnected; but the computation can be modified to produce an
open ear decomposition in this case. See [530] for details.
Given spanning tree T , the computation of edge labels as described above
involves only standard tree computations such as answering LCA queries
or computing the depth of a vertex in T . In [192] it is shown that these
computations can be carried out in O(sort(|V |)) I/Os. Since we consider
sparse graphs, tree T can be computed in O(sort(|V |)) I/Os using for example
the minimum spanning tree algorithm of Section 5.4.2. Hence, for sparse
graphs an (open) ear decomposition can be computed in O(sort(|V |)) I/Os.

5.5 Breadth-First Search and Single Source Shortest

Paths

After covering connectivity problems, we now turn to the first two graph
searching problems: breadth-first search (BFS) and the single source shortest
path (SSSP) problem. Since BFS is the same as the SSSP problem if all edges
in the graph have unit weight, and both problems have an Ω(perm(|V |)) I/O
lower bound, we restrict our attention to SSSP-algorithms. Even though the
details of the SSSP-algorithms for different graph classes differ, their efficiency
is based on the fact that the considered graph classes have small separators.
In particular, a separator decomposition of a graph in each such class can be
94 Laura Toma and Norbert Zeh

(a) (b)

Fig. 5.3. (a) A partition of a planar graph into the shaded subgraphs using the
black separator vertices. (b) The boundary sets of the partition.

obtained I/O-eﬃciently (see Section 5.7), and the shortest path algorithms
apply dynamic programming to such a decomposition in order to solve the
SSSP problem.

5.5.1 Planar Graphs and Grid Graphs

√ √
Every planar graph with N vertices or grid graph embedded into a N × N
grid contains a set S of O(N/B)
separator vertices whose removal partitions
the graph into O N/B 2 subgraphs of size at most B 2 and boundary size at
most B, where the boundary ∂Gi of a subgraph Gi is the set of separator
vertices adjacent to vertices in Gi [315]. The set of separator vertices can be
partitioned into maximal subsets so that the vertices in each subset are adja-
cent to the same set of subgraphs Gi . These sets are called the boundary sets
of the partition (see Fig. 5.3). If the graph has bounded degree, which is true
for grid graphs and can be ensured for planar graphs using a simple trans-
formation, there exists
a partition that, in addition to the above properties,
has only O N/B 2 boundary sets [315].
Given such a partition of G into subgraphs G1 , . . . , Gq , the single source
shortest path algorithm first computes a graph GR with vertex set S so that
the distances between two separator vertices in G and GR are the same.
Assuming that s ∈ S, the distances from s to all separator vertices can hence
be computed by solving the single source shortest path problem on GR , which
can be done I/O-efficiently due to the reduced size of the vertex set of GR .
Given the distances from s to all separator vertices, the distances from s to
all vertices in a graph Gi can be computed as distG (s, v) = min{distG (s, u) +
distRi (u, v) : u ∈ ∂Gi }, where Ri is the subgraph of G induced by the vertices
in V (Gi ) ∪ ∂Gi .
The construction of graph GR from graph G is done as follows: For each
graph Ri as defined above, compute the distances in Ri between all pairs of
5. I/O-Efficient Algorithms for Sparse Graphs 95

vertices in ∂Gi . Then construct a complete graph Ri with vertex set ∂Gi and
assign weight distRi (v, w) to every edge (v, w) ∈ Ri . Graph GR is the union
of graphs R1 , . . . , Rq .

Assuming that M = Ω B 2 , there is enough room in main memory to
store one graph Ri and its compressed version Ri . Hence, graph GR can be
computed from graph G by loading graphs R1 , . . . , Rq into main memory, one
at a time, computing for each graph Ri the compressed version Ri without
incurring any I/Os and writing Ri to disk. As this procedure requires a single
scan of the list of graphs R1 , . . . , Rq , and these graphs have a total size
of O(N ), graph GR can be constructed in O(scan(N )) I/Os. Similarly, once
the distances from s to all separator vertices are known, the computation
of the distances from s to all non-separator vertices can be carried out in
another scan of the list of graphs R1 , . . . , Rq because the computation for the
vertices in Ri is local to Ri .
From the above discussion it follows that the SSSP problem can be solved
in O(sort(N )) I/Os on G if it can be solved in that many I/Os on GR .
Since GR has only O(N/B) vertices and O N/B 2 · B 2 = O(N ) edges, the
SSSP problem on GR can be solved in O((N/B) log2 (N/B)) I/Os using the
shortest path algorithm described in Chapter 4. In order to reduce the I/O-
complexity of this step to O(sort(N )), Arge et al. [60, 68] propose a modified
version of Dijkstra’s algorithm, which avoids the use of a DecreaseKey
operation. This is necessary because the best known external priority queue
that supports this operation [485] takes O((N/B) log2 (N/B)) I/Os to process
a sequence of N priority queue operations, while there are priority queues
that do not support this operation, but can process a sequence of N Insert,
Delete, and DeleteMin operations in O(sort(N )) I/Os [52, 157].
In addition to a priority queue Q storing the unvisited vertices of GR ,
the algorithm of Arge et al. maintains a list L of the vertices of GR , each
labelled with its tentative distance from s. That is, for every vertex stored
in Q, its label in L is the same as its priority in Q. For a vertex not in Q, list L
stores its final distance from s. Initially, all distances, except that of s, are ∞.
Vertex s has distance 0. Now the algorithm repeatedly performs DeleteMin
operations on Q to obtain the next vertex to process. For every retrieved
vertex v, the algorithm loads the adjacency list of v into main memory and
updates the distances from s to v’s neighbors as necessary. (The adjacency
list of v fits into main memory because every vertex in S has degree O(B)
in GR . To see this, observe that each vertex in S is on the boundary of O(1)
subgraphs Gi because graph G has bounded degree, and each subgraph has at
most B boundary vertices.) In order to update these distances, the algorithm
retrieves the entries corresponding to v’s neighbors from L and compares the
current tentative distance of each neighbor w of v to the length of the path
from s to w through v. If the path through v is shorter, the distance from s
to w is updated in L and Q. Since the old tentative distance from s to w
is known, the update on Q can be performed by deleting the old copy of w
96 Laura Toma and Norbert Zeh

and inserting a new copy with the updated distance as priority. That is, the
required DecreaseKey operation is replaced by a Delete and an Insert
operation.
Since graph GR has O(N/B) vertices and O(N ) edges, retrieving all ad-
jacency lists takes O(N/B + scan(N )) = O(scan(N )) I/Os. For the same
reason, the algorithm performs only O(N ) priority operations on Q, which
takes O(sort(N )) I/Os. It remains to analyze the number of I/Os spent on
accessing list L. If the vertices in L are not arranged carefully, the algo-
rithm may spend one I/O per access to a vertex in L, O(N ) I/Os in total.
In order to reduce this I/O-bound to O(N/B), Arge et al. use the fact that
there are only O(N/B 2 ) boundary sets, each of size O(B). If the vertices
in each boundary set are stored consecutively in L, the bound on the size
of each boundary set implies that the vertices in the set can be accessed in
O(1) I/Os. Moreover, every boundary set is accessed only O(B) times, once
per vertex on the boundaries of the subgraphs deﬁning this boundary set.
Since there are O(N/B 2 ) boundary sets, the total number
of I/Os spent on
loading boundary sets from L is hence O B · N/B 2 = O(N/B).
The algorithm described above computes only the distances from s to all
vertices in G. However, it is easy to augment the algorithm so that it computes
shortest paths in O(sort(N )) I/Os using an additional post-processing step.

5.5.2 Graphs of Bounded Treewidth and Outerplanar Graphs

The SSSP algorithm for planar graphs and grid graphs computes shortest
paths in three steps: First it encodes the distances between separator ver-
tices in a compressed graph. Then it computes the distances from the source
to all separator vertices in this compressed graph. And finally it computes
the distances from the source to all non-separator vertices using the dis-
tance information computed for the separator vertices on the boundary of
the subgraph Gi containing each such vertex. The shortest path algorithm
for outerplanar graphs and graphs of bounded treewidth [522, 775] applies
this approach iteratively, using the fact that a tree-decomposition of the
graph provides a hierarchical decomposition of the graph using separators of
constant size.
Assume that the given tree-decomposition D = (T, X ) of G is nice in the
sense defined in Section 5.2 and that s ∈ Xv , for all v ∈ T .2 Then every
subtree of T rooted at some node v ∈ T represents a subgraph G(v) of G,
which shares only the vertices in Xv with the rest of G.
The first phase of the algorithm processes T from the leaves towards the
root and computes for every node v ∈ T and every pair of vertices x, y ∈ Xv ,
the distance from x to y in G(v). Since G(r) = G, for the root r of T , this
produces the distances in G between all vertices in Xr . In particular, the
2
Explicitly adding s to all sets Xv to ensure the latter assumption increases the
width of the decomposition by at most one.
5. I/O-Efficient Algorithms for Sparse Graphs 97

distances from s to all other vertices in Xr are known at the end of the first
phase. The second phase processes tree T from the root towards the leaves
to compute for every node v ∈ T , the distances from s to all vertices in Xv .
During the first phase, the computation at a node v uses only the weights
of the edges between vertices in Xv and distance information computed for
the vertices stored at v’s children. During the second phase, the computation
at node v uses the distance information computed for the vertices in Xv
during the first phase of the algorithm and the distances from s to all vertices
in Xp(v) , where p(v) denotes v’s parent in T . Since the computation at every
node involves only a constant amount of information, it can be carried out
in main memory. All that is required is passing distance information from
children to parents in the first phase of the algorithm and from parents to
children in the second phase. This can be done in O(sort(N )) I/Os using
time-forward processing because tree T has size O(N ), and O(1) information
is sent along every edge.
To provide at least some insight into the computation carried out at the
nodes of T , we discuss the first phase of the algorithm. For a leaf v, G(v) is
the graph induced by the vertices in Xv . In particular, |G(v)| = O(1), and
the distances in G(v) between all vertices in Xv can be computed in main
memory. For a forget node v with child w, G(v) = G(w) and Xv ⊂ Xw ,
so that the distance information for the vertices in Xv has already been
computed at node w and can easily be copied to node v. For an introduce
node v with child w, Xv = Xw ∪ {x}. A shortest path in G(v) between two
vertices in Xv consists of shortest paths in G(w) between vertices in Xw and
edges between x and vertices in Xw . Hence, the distances between vertices
in Xv are the same in G(v) and in a complete graph G (v) with vertex set Xv
whose edges have the following weights: If y, z ∈ Xw , then edge (y, z) has
weight distG(w) (y, z). Otherwise assume w.l.o.g. that y = x. Then the weight
of edge (x, z) is the same in G (v) as in G. The distances in G(v) between
all vertices in Xv can now be computed by solving all pairs shortest paths
on G (v). This can be done in main memory because |G (v)| = O(1). For a
join node u with children v and w, a similar graph G (u) of constant size
is computed, which captures the lengths of the shortest paths between all
vertices in Xu = Xv = Xw that stay either completely in G(v) or completely
in G(w). The distances in G(u) between all vertices in Xu are again computed
in main memory by solving all pairs shortest paths on G (u).
The second phase of the algorithm proceeds in a similar fashion, using the
fact that a shortest path from s to a vertex x in Xv either stays completely
inside G(v), in which case the shortest path information between s and x
computed in the first phase is correct, or it consists of a shortest path from s
to a vertex y in Xp(v) followed by a shortest path from y to x in G(p(v)).
Since outerplanar graphs have treewidth 2, the algorithm sketched above
can be used to solve SSSP on outerplanar graphs in O(sort(N )) I/Os. Alter-
natively, one can derive a separator decomposition of an outerplanar graph
98 Laura Toma and Norbert Zeh

(a) (b)

Fig. 5.4. (a) A planar graph G with its faces colored according to their levels.
Level-0 faces are white. Level-1-faces are light grey. Level-2 faces are dark grey.
(b) The corresponding partition of the graph into outerplanar subgraphs H0 (solid),
H1 (dotted), and H2 (dashed).

directly from an outerplanar embedding of the graph. This separator decom-

position has a somewhat simpler structure, which allows the algorithm to be
simpliﬁed and the constants to be improved. However, the overall structure
of the algorithm remains the same. The interested reader may refer to [775]
for details.

5.6 Depth-First Search

The algorithms of the previous section use small separators to compute short-
est paths I/O-efficiently. In this section we discuss algorithms that construct
DFS-trees of planar graphs, grid graphs, and outerplanar graphs. These al-
gorithms exploit the geometric structure of these graphs to carry out their
task in an I/O-efficient manner. Since graphs of bounded treewidth do not
exhibit such a geometric structure in general, the techniques used in these
algorithms fail on graphs of bounded treewidth, so that the problem of com-
puting a DFS-tree of a graph of bounded treewidth I/O-efficiently is open.

5.6.1 Planar Graphs

For the sake of simplicity, assume that the given planar graph G is bi-
connected. If this is not the case, a DFS-tree of G can be obtained in
O(sort(N )) I/Os by identifying the biconnected components of G using the
biconnectivity algorithm from Section 5.4.3 and merging appropriate DFS-
trees computed separately for each of these biconnected components.
In order to perform DFS in an embedded biconnected planar graph G, the
algorithm of [62], which follows ideas from [372], uses the following approach:
First the faces of G are partitioned into layers around a central face that
has the source of the DFS on its boundary (see Fig. 5.4a). The partition of
the faces of G into layers induces a partition of G into outerplanar graphs
of a particularly simple structure, so that DFS-trees of these graphs can be
5. I/O-Eﬃcient Algorithms for Sparse Graphs 99

6 8
7 10
5
9
s
v1 18 4 0
1
11
17 3
13 12
15
2
16 13 u2
u1
14 v2

(a) (b) (c)

Fig. 5.5. (a) The face-on-vertex graph GF shown in bold. (b) Spanning tree T1 and
layer graph H2 are shown in bold. Attachment edges (ui , vi ) are thin solid edges.
The vertices in T1 are labelled with their DFS-depths. (c) The ﬁnal DFS tree of G.

computed I/O-eﬃciently (see Fig. 5.4b). A DFS-tree of G is then obtained

by merging appropriate DFS-trees of these layer graphs.
Formally, the layers are defined as follows: The first layer consists of a
single face r that has the source s of the DFS on its boundary. Given the first
i layers, the (i + 1)-st layer contains all faces that share a vertex with a face
in the i-th layer and are not contained in layers 0 through i. Such a partition
of the faces of G into layers can be obtained using BFS in the face-on-vertex
graph GF of G. This graph contains all vertices of G as well as one vertex f ∗
per face f of G. There is an edge (v, f ∗ ) in GF if and only if vertex v is on
the boundary of face f in G. Given the BFS-distances of all vertices in GF
from r∗ , a face f is in the i-th layer if the BFS-distance of vertex f ∗ from r∗
is 2i (see Fig. 5.5a).
The computed layer partition of the faces of G defines the following par-
tition of G into outerplanar subgraphs H0 , H1 , . . . , Hk (see Fig. 5.4b). Let
Gi be the subgraph of G induced by all faces at levels 0 through i, and let
Ei = E(Gi ) \ E(Gi−1 ). Then edge set Ei can be partitioned into two sets Ei
and Ei . An edge is in Ei if none of its endpoints is on the boundary of a
level-(i − 1) face. Otherwise it is in Ei . Now graph Hi is defined as the sub-
graph of G induced by the edges in Ei . The edges in Ei are the attachment
edges of Hi , since they form the connection between Hi and Gi−1 .
In order to compute a DFS-tree of G, the algorithm now computes DFS-
trees T0 , . . . , Tk of graphs H0 = G0 , . . . , Gk = G, where for all 1 ≤ i ≤ k,
tree Ti is obtained by augmenting tree Ti−1 appropriately. To facilitate this
incremental construction, the algorithm maintains the invariant that the ver-
tices on each boundary cycle of Gi appear on a root-to-leaf path in Ti . We
call this the boundary cycle invariant. Using this invariant, a DFS-tree Ti+1
of Gi+1 can be obtained as follows: For every connected component H

of Hi+1 , find the attachment edge in Ei+1 that connects a vertex v in H
to a vertex u of maximal depth in Ti . Compute a DFS-tree of H rooted
at v and join it to Ti using edge (u, v). Since H is enclosed by exactly one
boundary cycle of Gi , the choice of edge (u, v) and the boundary cycle invari-
ant guarantee that for any other attachment edge (x, y) of H , its endpoint
100 Laura Toma and Norbert Zeh

x ∈ Gi is an ancestor of u in Ti . The endpoint y ∈ H is a descendant of w,

so that (x, y) is a back edge. Hence, Ti+1 is a DFS-tree of Gi+1 .
We have to discuss how to compute a DFS-tree of H in a manner that
maintains the boundary cycle invariant. This is where the simple structure
of Hi+1 comes into play. In particular, it can be shown that graph Hi+1 is a
“forest of cycles”. That is, every non-trivial biconnected component of Hi+1
is a simple cycle. Now consider the biconnected components of H . These
components form a tree of biconnected components, which can be rooted
at a component containing vertex v. For every biconnected component B,
except the root component, its parent cutpoint is the cutpoint shared by B
and its parent component in the tree of biconnected components. The parent
cutpoint of the root component is deﬁned to be v, even though v is not
necessarily a cutpoint of H . A DFS-tree of H is now obtained by removing
from every non-trivial biconnected component of H one of the two edges
incident to its parent cutpoint. Since the boundary cycles of Gi+1 are cycles
in Hi+1 , and each such cycle is a non-trivial biconnected component of Hi+1 ,
this maintains the boundary cycle invariant for Ti+1 .
The computation of a DFS-tree for H involves only computing the bicon-
nected components of H and the removal of appropriate edges from the non-
trivial biconnected components. The former can be done in O(sort(|H |)) I/Os
using the biconnectivity algorithm from Section 5.4.3. The latter can be done
in a constant number of sorting and scanning steps. Hence, computing DFS-
trees for all connected components of Hi+1 takes O(sort(|Hi+1 |)) I/Os. Find-
ing attachment edges (u, v) for all connected components of Hi+1 requires

sorting and scanning the vertex set of Hi and the set Ei+1 of attachment
edges, which takes O(sort(|Hi | + |Hi+1 |)) I/Os. Summing these complexi-
ties over all layers of G, the whole DFS-algorithm takes O(sort(|V |)) I/Os
because graphs H0 , . . . , Hk are disjoint.

5.6.2 Grid Graphs

Finding an I/O-eﬃcient algorithm for DFS in grid graphs is still an open

problem, even
√ though
the standard internal memory DFS-algorithm performs
only O N/ B I/Os if carried out carefully. In particular, Meyer [550] made
√ √
the following observation:
√ √ Consider a grid of size N by N divided into
subgrids of size B by B. Then a DFS-tree of G can be computed using the
standard internal memory algorithm: Start from the block (subgrid) contain-
ing the source vertex, perform DFS until the algorithm visits a vertex that
is not in the block, load the block containing this vertex into main memory,
and continue in this fashion until the DFS-tree is complete. A block may be
loaded several times during a run of the algorithm, each time to compute a
diﬀerent part of the DFS tree that lies within this block. However, the DFS
tree enters a block through a boundary vertex, and leaves it through a bound-
ary vertex. Every vertex is visited O(1) times by the DFS-algorithm. Since a
5. I/O-Eﬃcient Algorithms for Sparse Graphs 101
√
block has O B boundary vertices, this implies that every block is loaded
√ √ √
O B times, so that the algorithm takes O N/B · B = O N/ B I/Os.

5.6.3 Outerplanar Graphs

The DFS-algorithm for outerplanar graphs is hardly worth being called an

algorithm. It merely exploits the fact that a DFS-tree of the graph is already
encoded in an outerplanar embedding of the graph. In particular, if the graph
is biconnected, the path obtained by removing an edge from the boundary
cycle of the outer face of G is a DFS-tree of G. This path can be extracted
in O(scan(N )) I/Os from an appropriate representation of the outerplanar
embedding.
If the graph is not biconnected, the basic idea of the algorithm is to
walk along the outer boundary of G and keep track of the number of times
a vertex has been visited so far. If a vertex is visited for the ﬁrst time,
its predecessor along the outer boundary is made its parent in the DFS-
tree. Otherwise nothing is done for the vertex. This strategy can easily be
realized in O(scan(N )) I/Os using a stack, again assuming an appropriate
representation of the outerplanar embedding. The interested reader may refer
to [775] for details.

5.7 Graph Partitions

In this section we review algorithms for partitioning sparse graphs into

smaller subgraphs. In particular, given a graph G belonging to a class C
of sparse graphs and an integer h > 0, the goal is to compute a small
set S of vertices whose removal partitions G into k = O(N/h) subgraphs
G1 , . . . , Gk of size at most h. We refer to the pair P = (S, {G1 , . . . , Gk }) as an
h-partition of G. The vertices in S are referred to as separator vertices. Set S
as a whole is referred to as the separator that induces partition P. Finally,
let σ(C, N, h) = maxG minS {|S| : separator S induces an h-partition of G},
where the maximum is taken over all N -vertex graphs in class C. We call par-
tition P optimal if |S| = O(σ(C, |G|, h)). Among the algorithms presented so
far, only the shortest path algorithm for planar graphs requires an optimal h-
partition for h = B 2 , while the algorithm for outerplanar graphs and graphs
of bounded treewidth only relies on the fact that a recursive partition of the
graph using separators of constant size is encoded in the tree-decomposition
of the graph. Nevertheless graph partitions have been applied to design ef-
ﬁcient sequential and parallel algorithms for a wide range of problems and
may prove useful for designing I/O-eﬃcient algorithms for these problems as
well.
102 Laura Toma and Norbert Zeh

DFS

BFS Graph partition

SSSP
Fig. 5.6. O(sort(N )) I/O reductions between fundamental problems on pla-
nar graphs. An arrow indicates that the pointing problem can be solved in
O(sort(N )) I/Os if the problem the arrow points to can be solved in that many
I/Os.

5.7.1 Planar Graphs

A number of researchers have put considerable eﬀort into designing algo-

rithms that compute optimal partitions of planar graphs I/O-efficiently. The
first step towards a solution has been made in [419], where it is shown
that an optimal 23 N -partition of a planar graph G can be computed in
O(sort(N )) I/Os using an I/O-efficient version of Lipton and Tarjan’s al-
gorithm [507]. Unfortunately the I/O-bound holds only if a BFS-tree and a
planar embedding of G are given. Arge et al. [60] present an I/O-efficient
variant of an algorithm due to Goodrich [344]. Given a planar graph G and
√ takes O(sort(N )) I/Os to compute a sepa-
an integer h > 0, the algorithm
rator of size O sort(N ) + N/ h that induces an h-partition of G. Again, a
BFS-tree and a planar embedding of G are required. In addition, the amount
of available memory is required to be M > B 2+α , for some α > 0. Together
with the shortest path algorithm from Section 5.5.1, the DFS-algorithm from
Section 5.6.1, and the observation that BFS can be solved using an SSSP-
algorithm, the latter result leads to circular dependencies between different
fundamental problems on embedded planar graphs as shown in Fig. 5.6. In
particular, if any of the three problems in the cycle—i.e., computing optimal
h-partitions, BFS, or SSSP—can be solved in O(sort(N )) I/Os, all problems
in Fig. 5.6 can be solved in this number of I/Os on embedded planar graphs.
The break-through has been achieved in [523], where it is shown that an
optimal h-partition of a planar graph G can be computed in O(sort(N )) I/Os
without using a solution to any of the other problems,
provided that the
amount of available main memory is Ω h log2 B . Since the shortest path
algorithm from Section 5.5.1 requires an optimal B 2 -partition, this implies
that BFS, DFS, and SSSP can be solved in O(sort(N )) I/Os, provided that
M = Ω B 2 log2 B . In [775], it is shown that using these results and an
I/O-efficient version of a recent result of [26], optimal partitions of planar
graphs with costs and weights on their vertices and optimal edge separators
of weighted planar graphs can be obtained in O(sort(N )) I/Os. We sketch
here the result of [523] on computing unweighted partitions because it is the
5. I/O-Efficient Algorithms for Sparse Graphs 103

key to the efficiency of the other algorithms, and refer the reader to [775] for
the details of the more general separator algorithm.
The algorithm of [523] obtains an optimal h-partition of G by careful
application of the graph contraction technique, combined with a linear-time
internal memory algorithm for this problem. In particular, it first constructs
a hierarchy of planar graphs G = H0 , . . . , Hr whose sizes are geometrically
decreasing and so that |Hr | = O(N/B). The latter implies that applying the
internal memory algorithm to Hr in order to compute an optimal partition
of Hr takes O(N/B) = O(scan(N )) I/Os. Given the partition of Hr , the
algorithm now iterates over graphs Hr−1 , . . . , H0 , in each iteration deriving a
partition of Hi from the partition of Hi+1 computed in the previous iteration.
The construction of a separator Si for Hi starts with the set Si of vertices
in Gi that were contracted into the vertices in Si+1 during the construction
of Hi+1 from Hi . Set Si induces a preliminary partition of Hi , which is then
refined by adding new separator vertices to Si . The resulting set is Si .
The efficiency of the procedure and the quality of its output depend heav-
ily on the properties of the computed graph hierarchy. In [523] it is shown
that a graph hierarchy G = H0 , . . . , Hr with the following properties can be
constructed in O(sort(N )) I/Os:
(i) For all 0 ≤ i ≤ r, graph Hi is planar,
(ii) For all 1 ≤ i ≤ r, every vertex in Hi represents a constant number of
i
vertices in Hi−1 and at most 2 ivertices
in G, and
(iii) For all 0 ≤ i ≤ r, |Hi | = O N/2 .
Choosing r = log B, Property (iii) implies that |Hr | = O(N/B), as re-
quired by the algorithm. Properties (ii) and (iii) can be combined with an
appropriate choice of the size of the subgraphs in the partitions of graphs
Hr , . . . , H1 to guarantee that the final partition of G is optimal. In partic-
ular,
the algorithm makes sure that for 1 ≤ i ≤ r, separator Si induces an
h log2 B -partition of Hi , and only the refinement step computing S = S0
from S0 has the goal of producing an h-partition of G = H0 . Aleksandrov

and Djidjev [25] show that for any graph of√size N and any h > 0, their al-
gorithm computes a separator of size O N/ h that induces an h -partition
of the graph. Hence, since we use the algorithm of [25] to compute
√ Sr
and to
derive separator Si from Si , for 0 ≤ i < r, |Sr | = O |Hr |/ h log B , and
√
for i > 0, the construction of Si from Si adds O |Hi |/ h log B separator
√
vertices to Si . By Properties (ii) and (iii), this implies that |S0 | = O N/ h .
In order
√ to obtain an h-partition of G, the algorithm of [25] adds another
O N/ h separator vertices to S0 , so that S induces an optimal h-partition
of G.
The efficiency of the algorithm also follows from Properties (ii) and (iii).
We have already argued that for r = log B, |Hr | = O(N/B), so that the
linear-time separator algorithm takes O(scan(N )) I/Os to compute the initial
h log2 B -partition of Hr . Property (ii) implies that separator Si induces a
104 Laura Toma and Norbert Zeh

ch log2 B -partition of Hi , for some constant c ≥ 1. Under the assumption
that M ≥ ch log2 B, this implies that every connected component of Hi − Si
fits into main memory. Hence, the algorithm of [523] computes the connected
components of Hi − Si , loads each of them into main memory and applies
the internal memory algorithm of [25] to partition it into subgraphs of size
at most h log2 B (or h, if i = 0).
Since Sr can be computed in O(scan(N )) I/Os and the only external
memory computation required to derive Si from Si is computing the con-
nected components of Hi − Si , the whole algorithm takes O(scan(N )) +
r−1 r−1
i=0 O(sort(|H i |)) = i=0 O sort N/2 i
= O(sort(N )) I/Os.
In order to use the computed partition in the SSSP-algorithm from Sec-
tion 5.5.1, it has to satisfy a few more stringent properties than optimality in
√ that each of the O(N/h)
the above sense. In particular, it has to be guaranteed
subgraphs in the partition is adjacent to at most h separator vertices and
that there are only O(N/h) boundary sets as defined in Section 5.5.1. In [775],
it is shown that these properties can be ensured using a post-processing that
takes O(sort(N )) I/Os and increases the size of the computed separator by
at most a constant factor. The construction is based on ideas from [315].

5.7.2 Grid Graphs

For a grid graph G, the geometric information associated with its vertices
makes it very easy to compute an h-partition of G. In particular, every vertex
stores its coordinates (i, j) in the grid. Then
√ the separator
√ √ S is chosen to
contain all√vertices
in rows and columns h, 2 h, 3 h, .... Separator S has
size O N/ h and partitions G into subgraphs of size at most h. That is,
the computed partition is optimal. Since every vertex in a grid graph can
be connected only to its eight neighboring grid √ vertices, each subgraph in
the computed partition is adjacent to at most 4 h separator vertices. The
number of boundary sets in the partition is O(N/h). Hence, this partition
can be used in the shortest path algorithm from Section 5.5.1.

5.7.3 Graphs of Bounded Treewidth and Outerplanar Graphs

As in the case of the single-source shortest path problem, the computation

of h-partitions for graphs of bounded treewidth and outerplanar graphs is
similar. Again, for outerplanar graphs, a simpliﬁed algorithm producing a
slightly better partition is presented in [775]; but the basic approach is the
same as in the general algorithm for graphs of bounded treewidth, which we
describe next.
As the shortest path algorithm, the algorithm starts by computing a nice
tree-decomposition D = (T, X ) of G using either the tree-decomposition al-
gorithm from Section 5.8.2 or the outerplanar embedding algorithm from
Section 5.8.3. The goal of the algorithm is to use the tree-decomposition to
5. I/O-Eﬃcient Algorithms for Sparse Graphs 105

compute an optimal h-partition of G, i.e., a partition of G into subgraphs of

size at most h using a separator of size O(kN/h), where k is the treewidth
of G. To do this, tree T is processed from the leaves towards the root, starting
with an empty separator S. For every node i ∈ T , a weight ω(i) is computed,
which equals the size of the connected component of G(i) − S containing
the vertices in Xi . At a leaf, ω(i) is computed as follows: If |G(i)| > h/2,
all vertices in Xi are added to S, and ω(i) = 0. Otherwise ω(i) = |G(i)|.
At a forget node, ω(i) = ω(j). At an introduce node, let ω (i) = ω(j) + 1.
If ω (i) > h/2, the vertices in Xi are added to S, and ω(i) = 0. Otherwise
ω(i) = ω (i). Finally, at a join node, let ω (i) = ω(j) + ω(k). Then ω(i) is
computed using the same rules as for an introduce node. It is easily veriﬁed
that the computed vertex set S induces an h-partition of G. To see that S
has size O(kN/h), observe that every group of k + 1 vertices added to S
can be charged to a set of at least h/2 vertices in G and that these groups
of charged vertices are disjoint. Hence, at most (k + 1)N/(h/2) = O(kN/h)
vertices are added to S.

5.8 Gathering Structural Information

Having small separators is a structural property that all the graph classes we
consider have in common. In this section we review I/O-eﬃcient algorithms
to gather more speciﬁc information about each class. In particular, we sketch
algorithms for computing outerplanar and planar embeddings of outerplanar
and planar graphs and tree-decompositions of graphs of bounded treewidth.
These algorithms are essential, at least from a theoretical point of view, as
all of the algorithms presented in previous sections, except the separator
algorithm for planar graphs, require an embedding or tree-decomposition to
be given as part of the input.

5.8.1 Planarity Testing and Planar Embedding

In order to test whether a given graph is planar, the algorithm of [523] exploits
the fact that the separator algorithm from Section 5.7.1 does not require a
planar embedding to be given as part of the input. In fact, the algorithm
can be applied even without knowing whether G is planar. The strategy
of the planar embedding algorithm is to use the separator algorithm and
try to compute an optimal B 2 -partition of G whose subgraphs G1 , . . . , Gq
have boundary size at most B. If the separator algorithm fails to produce
the desired partition in O(sort(N )) I/Os, the planar embedding algorithm
terminates and reports that G is not planar. Otherwise the algorithm ﬁrst
tests whether each of the graphs G1 , . . . , Gq is planar. If one of these graphs
is non-planar, graph G cannot be planar. If graphs G1 , . . . , Gq are planar,
each graph Gi is replaced with a constraint graph Ci of size O(B). These
106 Laura Toma and Norbert Zeh

constraint graphs have the property that graph G is planar if and only if
the approximate graph A obtained by replacing each subgraph Gi with its
constraint graph Ci is planar. If A is planar, a planar embedding of G is
obtained from a planar embedding of A by locally replacing the embedding
of each constraint graph Ci with a consistent planar embedding of Gi .
This approach leads to an I/O-efficient algorithm using the following
observations: (1) Graphs G1 , . . . , Gq have size at most B 2 , and graphs
C1 , . . . , Cq have size O(B) each. Thus, the test of each graph Gi for planarity
and the construction of the constraint graph Ci from Gi can be carried out in
main memory, provided that M ≥ B 2 , which has to be true already in order
to apply the separator algorithm. (2) Graph A has size O(N/B) because it
is constructed from O(N/B 2 ) constraint graphs of size O(B), so that a lin-
ear time planarity testing and planar embedding algorithm (e.g., [144]) takes
O(scan(N )) I/Os to test whether A is planar and if so, produce a planar
embedding of A. The construction of consistent planar embeddings of graphs
G1 , . . . , Gq from the embeddings of graphs C1 , . . . , Cq can again be carried
out in main memory.
This seemingly simple approach involves a few technicalities that are dis-
cussed in detail in [775]. At the core of the algorithm is the construction of
the constraint graph Ci of a graph Gi . This construction is based on a careful
analysis of the structure of graph Gi and beyond the scope of this survey. We
refer the reader to [775] for details. However, we sketch the main ideas.
The construction is based on the fact that triconnected planar graphs
are rigid in the sense that they have only two different planar embeddings,
which can be obtained from each other by “flipping” the whole graph. The
construction of constraint graph Ci partitions graph Gi into its connected
components, each connected component into its biconnected components,
and each biconnected component into its triconnected components. The con-
nected components can be handled separately, as they do not interact with
each other. The constraint graph of a connected component is constructed
bottom-up from constraint graphs of its biconnected components, which in
turn are constructed from constraint graphs of their triconnected compo-
nents.
The constraint graph of a triconnected component is triconnected, and
its embedding contains all faces of the triconnected component so that other
parts of G may be embedded in these faces. The rest of the triconnected
component is compressed as far as possible while preserving triconnectivity
and planarity.
The constraint graph of a biconnected component is constructed from the
constraint graphs of its triconnected components by analyzing the amount of
interaction of these triconnected components with the rest of G. Depending
on these interactions, the constraint graph of each triconnected component
is either (1) preserved in the constraint graph of the biconnected component,
(2) grouped with the constraint graphs of a number of other triconnected
5. I/O-Efficient Algorithms for Sparse Graphs 107

components, or (3) does not appear in the constraint graph of the biconnected
component at all because it has no inﬂuence on the embedding of any part
of G that is not in this biconnected component. In the second case, the group
of constraint graphs is replaced with a new constraint graph of constant size.
The constraint graph of a connected component is constructed in a similar
manner from the constraint graphs of its biconnected components.

5.8.2 Computing a Tree-Decomposition

The algorithm of [522] for computing a tree-decomposition of width k for a

graph G of treewidth k follows the framework of the linear-time algorithm
for this problem by Bodlaender and Kloks [135, 136]. The algorithm can also
be used to test whether a given graph G has treewidth at most k, as long
as k is constant. The details of the algorithm are rather involved, so that we
sketch only the main ideas here. In [135] it is shown that for every k > 0,
there exist two constants c1 , c2 > 0 so that for every graph G of treewidth k
one of the following is true: (1) Every maximal matchingf G contains at least
c1 N edges. (2) G contains a set X of at least c2 N vertices so that a tree-
decomposition of width k for G can be obtained by attaching one node of
size at most k + 1 per vertex in X to a tree-decomposition of width k for the
graph G − X. In the latter case, the algorithm of [522] recursively computes
a tree-decomposition of G − X and then attaches the additional nodes in
O(sort(N )) I/Os. In the former case, it computes a tree-decomposition D of
width at most k for the graph G obtained by contracting the edges in a max-
imal matching. The maximal matching can be computed in O(sort(N )) I/Os
using an algorithm of [775]. By replacing every vertex in G that corresponds
to an edge in the matching by the two endpoints of this edge, one immediately
obtains a tree-decomposition of width at most 2k + 1 for G. In order to ob-
tain a tree-decomposition of width at most k, an I/O-eﬃcient version of the
algorithm of [136] is used, which computes the desired tree-decomposition
in O(sort(N )) I/Os. This algorithm starts by transforming the given tree-
decomposition D of width at most 2k + 1 into a nice tree-decomposition
of width at most 2k + 1 and size O(N ). Then dynamic programming is ap-
plied to this tree-decomposition, from the leaves towards the root, in order
to compute an implicit representation of a tree-decomposition of width k for
graph G. In a second pass of time-forward processing from the leaves towards
the root, the tree-decomposition is extracted from this implicit representa-
tion. The details of this algorithm are complex and beyond the scope of this
survey.
Given that each recursive step of the algorithm takes O(sort(N )) I/Os,
the I/O-complexity of the whole algorithm is O(sort(N )), since graphs G
and G − X passed to recursive invocations of the algorithm contain only a
constant fraction of the vertices and edges of G.
108 Laura Toma and Norbert Zeh

5.8.3 Outerplanarity Testing and Outerplanar Embedding

In order to compute an outerplanar embedding of an outerplanar graph, the

algorithm of [775]3 exploits the following two observations: (i) An outerpla-
nar embedding of an outerplanar graph can be obtained from outerplanar
embeddings of the biconnected components of the graph, by making sure
that no biconnected component is embedded in an interior face of another
biconnected component. (ii) The boundary of the outer face of an outerpla-
nar embedding of a biconnected outerplanar graph G is the only cycle in G
that contains all vertices of G.
Observation (i) can be used to reduce the problem of computing an
outerplanar embedding of G to that of computing outerplanar embeddings
of its biconnected components. These components can be computed in
O(sort(N )) I/Os using the biconnectivity algorithm from Section 5.4.3, so
that any outerplanar graph can be embedded in O(sort(N )) I/Os if a bi-
connected outerplanar graph can be embedded in this number of I/Os. To
achieve the latter, Observation (ii) is exploited.
The algorithm for embedding a biconnected outerplanar graph computes
the boundary cycle C of G, numbers the vertices along C, and uses this
numbering of the vertices to derive the final embedding of G. Assume for
now that cycle C is given. Then the desired numbering of the vertices of G
can be obtained by removing an arbitrary edge from cycle C and applying
the Euler tour technique and list ranking to the resulting path in order to
compute the distances of all vertices in this path from one of the endpoints
of the removed edge. Given the numbering of the vertices along C, the edges
incident to every vertex in G can be ordered clockwise around this vertex
using the observation that these edges appear in the same order clockwise
around v as their endpoints clockwise along C. Hence, an outerplanar embed-
ding of G can be computed in O(sort(N )) I/Os if cycle C can be identified
in O(sort(N )) I/Os.
To compute cycle C, the algorithm exploits the fact that this cycle is
unique. In particular, an algorithm that computes any cycle containing all
vertices in G must produce cycle C. A cycle containing all vertices of G can
be computed as follows from an open ear decomposition of G. Let P0 , . . . , Pq
be the ears in the ear decomposition. Then remove every ear Pi , i > 0, that
consists of a single edge. The resulting graph is a biconnected subgraph of G
that contains all vertices of G. For every remaining ear Pi , i > 0, remove
edge (a, b) from G, where a and b are the endpoints of Pi . This procedure
can easily be carried out using a constant number of sorts and scans, so that
it takes O(sort(N )) I/Os. The following argument shows that the resulting
graph is the desired cycle C.
Let P0 , . . . , Pr be the set of ears remaining after removing all ears Pi ,
i > 0, consisting of a single edge. Then the above construction is equivalent
3
The algorithm for this problem originally presented in [521] is more complicated,
so that we present the simplified version from [775] here.
5. I/O-Efficient Algorithms for Sparse Graphs 109

to the following construction of graphs G1 , . . . , Gr . Graph G1 is the union of

ears P0 and P1 . Since ear P0 consists of a single edge, and the endpoints of
ear P1 are in P0 , graph G1 is a cycle. In order to construct graph Gi from
cycle Gi−1 , remove the edge connecting the endpoints of ear Pi from Gi−1
and add Pi to the resulting graph. If the endpoints of ear Pi are adjacent in
Gi−1 , graph Gi is again a cycle. But if the endpoints of Pi are not adjacent
in Gi−1 , it follows from the biconnectivity of Gi−1 and the fact that Pi con-
tains at least one internal vertex that graph Gi contains a subgraph that
is homeomorphic to K2,3 , so that Gi and hence G cannot be outerplanar.
Applying this argument inductively, we obtain that Gr = C.
The algorithm sketched above can easily be augmented to test whether a
given graph is outerplanar. For details, we refer the reader to [775].

5.9 Conclusions and Open Problems

The algorithms for BFS, DFS, and SSSP on special classes of sparse graphs
are a major step towards solving these problems on sparse graphs in general.
In particular, the results on planar graphs have answered the long stand-
ing question whether these graphs allow O(sort(N )) I/O solutions for these
problems. However, all these algorithms are complex because they are based
on computing separators. Thus, the presented results pose a new challenge,
namely that of finding simpler, practical algorithms for these problems.
Since the currently
best known
separator algorithm for planar graphs
requires that M = Ω B 2 log2 B , the algorithms for BFS, DFS, and SSSP on
planar graphs inherit this constraint. It seems that this memory requirement
of the separator algorithm (and hence of the other algorithms as well) can be
removed or at least reduced if the semi-external single source shortest path
problem can be solved in O(sort(|E|)) I/Os on arbitrary graphs. (“Semi-
external” means that the vertices of the graph fit into main memory, but the
edges do not.)
For graphs of bounded treewidth, the main open problem is finding an
I/O-efficient DFS-algorithm. Practicality is not an issue here, as the chances
to obtain practical algorithms for these graphs are minimal, as soon as the
algorithms rely on a tree-decomposition.
For grid graphs, the presented shortest path algorithm uses a partition of
the graph into a number of cells that depends on the size of the grid. This
may be non-optimal if the graph is an extremely sparse subgraph of the grid.
An interesting question here is whether it is possible to exploit the geometric
information provided by the grid to obtain a partition of the same quality
as the one obtained by the separator algorithm for planar graphs, but with
much less effort, i.e., in a way that leads to a practical algorithm.
6. External Memory Computational Geometry
Revisited
Christian Breimann and Jan Vahrenhold∗

6.1 Introduction

Computational Geometry is an area in Computer Science basically concerned

with the design and analysis of algorithms and data structures for problems
involving geometric objects. This ﬁeld started in the 1970’s and has evolved
into a discipline reaching out to areas such as Complexity Theory, Discrete
and Combinatorial Geometry, or Algorithm Engineering. Geometric problems
occur in a variety of applications, e.g., Computer Graphics, Databases, Geo-
sciences, or Medical Imaging, and there are several textbooks presenting (in-
ternal memory) geometric algorithms [239, 275, 342, 541, 566, 596, 614, 647].
The systematic investigation of geometric algorithms speciﬁcally designed for
massive data sets started in the early 1990’s, most noticeably after Goodrich
et al. presented their pioneering paper “External Memory Computational
Geometry” [345].
In our survey, we intend to give an overview of results that have been
obtained during the last decade and try to relate these results to internal
memory algorithms as well. We will review algorithms and data structures
for geometric problems involving massive data sets. Our focus will be both on
theoretical results and on practical applications. Due to this double focus, this
chapter contains not only an overview of fundamental geometric problems and
corresponding specialized algorithms and data structures developed in the
areas of Computational Geometry and Spatial Databases, but we also discuss
how general-purpose index structures already implemented in commercial
database systems can be used for solving geometric problems.
As a prominent application area involving massive data sets, spatial
database systems have attracted increasing interest both in research com-
munities and among professional users. In addition to the growing number of
applications, the increasing availability of spatial data in form of digital maps
and images is one of the main reasons for this trend, and tightly coupled to
this, applications demand sophisticated computations and complex analyses
of such data. In this area, it is not uncommon to use sub-optimal algorithms
(in terms of their asymptotic complexity) if they lead to better performance
in practice.
∗
Part of this work was done while on leave at the University for Health Informatics
and Technology Tyrol, 6020 Innsbruck, Austria.

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 110-148, 2003.
© Springer-Verlag Berlin Heidelberg 2003
6. External Memory Computational Geometry Revisited 111

The geometric problems described in the remainder of this survey are

grouped according to the kind of objects for which the problem is deﬁned.
In Section 6.3, we describe problems involving sets of points (Problems 6.1–
6.11), in Section 6.4, we present a discussion of problems involving sets of
segments (Problems 6.12–6.20), and in Section 6.5, we conclude by survey-
ing problems involving sets of polygons (Problem 6.21 and Problem 6.22).
Table 6.1 contains an overview of all problems covered in this chapter.

Table 6.1. Geometric problems surveyed in this chapter.

No. Problem name No. Problem Name
1 Convex Hull 12 Segment Stabbing
2 Halfspace Intersection 13 Segment Sorting
3 Closest Pair 14 Endpoint Dominance
4 K-Bichromatic Closest Pairs 15 Trapezoidal Decomposition
5 Nearest Neighbor 16 Polygon Triangulation
6 All Nearest Neighbors 17 Vertical Ray-Shooting
7 Reverse Nearest Neighbors 18 Planar Point Location
8 K-Nearest Neighbors 19 Bichromatic Segment Intersec-
tion
9 Halfspace Range Searching 20 Segment Intersection
10 Orthogonal Range Searching 21 Rectangle Intersection
11 Voronoi Diagram 22 Polygon Intersection

Whenever appropriate, we will sub-classify problems according to the ex-

tent to which the sets may be updated:
Static Setting: All data items are ﬁxed prior to running an algorithm or
building a data structure, and no changes to the data items may occur
afterwards.
Dynamic Setting: The set of items that forms the problem instance can be
updated by insertions as well as deletions.
Semidynamic Setting: The set of items that forms the problem instance can
be updated by either insertions or deletions, but not both.
The dynamic and semidynamic setting can also be considered in a batched
variant, that is, all updates have to be known in advance. For problems that
involve answering queries, we additionally distinguish between two kinds of
queries:
Single-Shot Queries: Each query has to be answered independent of other
queries and before the next query may be posed.
Batched Queries: The user speciﬁes a collection of queries, and the only re-
quirement is that all queries are answered by the end of the algorithm.
112 Christian Breimann and Jan Vahrenhold

Before surveying geometric problems and corresponding solutions, we will

brieﬂy review the model of computation and introduce three general tech-
niques for solving large-scale geometric problems.

6.2 General Methods for Solving Geometric Problems

External memory algorithms are investigated analytically in the parallel disk

model introduced by Aggarwal and Vitter [17] and later refined by Vitter and
Shriver [755]. The parallel disk model, which is based on blocked transfers,
uses the following parameters:
N = Number of objects in the problem instance
M = Number of objects that fit simultaneously into main memory
B = Number of objects that fit into one disk block
D = Number of independent disks
P = Number of parallel processors
In this survey, however, we will restrict ourselves to algorithms for single-
disk/single-processor settings, that is, we assume D = 1 and P = 1. For
algorithms involving multiple queries, we consider two additional parameters:
Q = Number of queries
Z = Number of objects in the answer set
This model allows computations only on elements that are in main mem-
ory, and whenever additional elements are needed in main memory, they have
to be read from disk. The measures of performance for an external memory
algorithm are the number of I/Os performed during its execution and the
amount of disk space occupied (in terms of disk blocks).
In the remainder of this section, we present some general methods which
are often used to solve geometric problems. First of all, we briefly discuss
the implications of solving a problem by reducing it to a problem for which
an efficient algorithm is known and present the concept of duality which
sometimes can be used for this purpose (Section 6.2.1). In Section 6.2.2,
we describe the general distribution sweeping paradigm, an external version
of the well-known plane sweeping. Section 6.2.3 covers the R-tree, a spatial
index structure frequently used in spatial database systems, and some of its
variants.

6.2.1 Reduction of Problems

A common technique for proving lower bounds for (geometric) problems is to

reduce the problem to some fundamental problem for which a lower bound is
known. Among these fundamental problems is the Element Uniqueness prob-
lem which is, given a collection of N objects, to determine whether any two
6. External Memory Computational Geometry Revisited 113

are identical. The lower bound for this problem is Ω((N/B) logM/B (N/B))
[59], and—looking at the reduction from the opposite direction—a matching
upper bound for the Element Uniqueness problem for points can be obtained
by solving what is called the Closest Pair problem (see Problem 6.3). For a
given collection of points, this problem consists of computing a pair with min-
imal distance. This distance is non-zero if and only if the collection does not
contain duplicates, that is if and only if the answer to the Element Uniqueness
problem is negative.

(a) Concept of transformation. (b) Transformation of bounds.

Fig. 6.1. Reduction of problems.

A more general view is given by Figure 6.1. It shows that reducing (trans-
forming) a problem A to a problem B means transforming the input of A first,
then solving problem B, and transforming its solution back afterwards (see
Figure 6.1 (a)). Such a transformation is said to be a τ (N )-transformation
if and only if transforming both the input and the solution can be done in
O(τ (N )) time. If an algorithm for solving the problem B has an asymptotic
complexity of O(fB (N )), the problem A can be solved in O(fB (N ) + τ (N ))
time. In addition, if the intrinsic complexity of the problem A is Ω(fA (N ))
and if τ (N ) ∈ o(fA (N )), then B also has a lower bound of Ω(fA (N )) (see Fig-
ure 6.1 (b) and, e.g., the textbook by Preparata and Shamos [614, Chap. 1.4]).
Reduction via Duality In Section 6.3, which is entitled “Problems Involv-
ing Sets of Points”, we will discuss the following problem (Problem 6.2):
“Given a set S of N halfspaces in IRd , compute the common inter-
section of these halfspaces.”
At first, it seems surprising that this problem should be discussed in
a section devoted to problems involving set of points. Using the concept
of geometric duality, however, points and halfspaces can be identified in a
consistent way: A duality transform maps points in IRd into the set G d of
non-vertical hyperplanes in IRd and vice versa. The classical duality transform
between points and hyperplanes is defined as follows:

d−1
G d → IRd : xd = ad + i=1 ai xi → (a1 , . . . , ad )
D: d−1
IRd → G d : (b1 , . . . , bd ) → xd = bd − i=1 bi xi
114 Christian Breimann and Jan Vahrenhold

Another well-known transform D that is used, e.g., in the context of the

Convex Hull problem (Problem 6.1), maps a point p on the unit parabola to
the unique hyperplane that is tangent to the parabola in p. For the sake of
simplicity, this duality transform is stated for d = 2.
2
G → IR2 : y = 2ax − b → (a, b)
D:
IR2 → G 2 : (a, b) → y = 2ax − b
An important property of these transforms is that they are their own in-
verses and that they preserve the “above-below” relation: a point p lies above
(below) the hyperplane with respect to the d-th dimension if and only if the
line D(p) lies above (below) the point D() with respect to the d-th dimension.
This property is exploited in several algorithms for, e.g., the Range Search-
ing problem, the Convex Hull problem, the K-Nearest Neighbors problem,
or the Voronoi Diagram problem. These algorithms ﬁrst reduce the original
problem to a problem stated for the duals of the original objects, solve the
problem in the dual setting, and ﬁnally employ the same duality transform
to obtain the solution to the original problem. We refer the interested reader
to textbooks on Computational Geometry (e.g. [275, 541, 596]) for a more
detailed treatment of these duality transforms.

6.2.2 Distribution Sweeping

A large number of internal memory algorithms is based upon plane sweeping,

a general technique for turning a static (d + 1)-dimensional problem into a
(finite) collection of instances of a dynamic d-dimensional problem. Although
the general approach is independent of the dimension d, it is most efficient
when the (original) problem is two-dimensional, and therefore we will restrict
the following description to this setting.
The characterizing feature of the plane sweeping technique is an (imag-
inary) line (or, in the general setting, a hyperplane) that is swept over the
entire data set. For sake of simplicity, this sweep-line is usually assumed to
be perpendicular to the x-axis of the coordinate system and to move from
left to right. Any object intersected by the sweep-line at x = t is called active
at time t, and only active objects are involved in geometric computations at
that time. In the situation depicted in Figure 6.2(a), the sweep-line is drawn
in bold, and the active objects, i.e., the objects intersected by the sweep-line,
are the line segments A and B.
To guarantee the correctness of a plane-sweep algorithm, one has to take
care of restating the original problem in such a way that operations involving
only active objects are sufficient to determine the proper solution to the
problem, e.g., Graf [350] summarized several formulations of plane-sweep
algorithms.
All objects active at a given time are usually stored in a dictionary called
sweep-line structure. The status of the sweep-line structure is updated as soon
6. External Memory Computational Geometry Revisited 115

(a) Internal plane sweeping. (b) External distribution sweeping.

Fig. 6.2. The plane sweeping and distribution sweeping techniques.

as the sweep-line moves to a point where the topology of the active objects
changes discontinuously: for example, an object must be inserted into the
sweep-line structure as soon as the sweep-line hits its leftmost point, and it
must be removed after the sweep-line has passed its rightmost point. The
sweep-line structure can be maintained in logarithmic time per update if the
objects can be ordered linearly, e.g., by the y-value of their intersection with
the sweep line.
For a finite set of objects, there are only finitely many points where the
topology of the active objects changes discontinuously, e.g., when objects are
inserted into or deleted from the sweep-line structure; these points are called
events and are stored in increasing order of their x-coordinates, e.g, in a
priority queue. Depending on the problem to be solved, there may exist ad-
ditional event types apart from insert and delete events. The data structure
for storing the events is called event queue, and maintaining it as a prior-
ity queue under insertions and deletions can be accomplished in logarithmic
time per update. That is, if the active objects can be ordered linearly, each
event can be processed in logarithmic time (excluding the time needed for
operations involving active objects). As a consequence, the plane sweeping
technique often leads to optimal algorithms, e.g., the Closest Pair problem
can be solved in optimal time O(N log2 N ) [398].
The straightforward approach for externalizing the plane sweeping tech-
nique would be to replace the (internal) sweep-line structure by a corre-
sponding external data structure, e.g., a B-tree [96]. A plane-sweep algo-
rithm with an internal memory time complexity of O(N log2 N ) then spends
O(N logB N ) I/Os. For problems with an external memory lower bound of
Ω((N/B) logM/B (N/B)), however, the latter bound is at least a factor of
B away from optimal.1 The key to an efficient external sweeping technique
is to combine sweeping with ideas similar to divide-and-conquer, that is, to
subdivide the plane prior to sweeping. To aid imagination, consider the plane
subdivided into Θ(M/B) parallel (vertical) strips, each containing the same
1
Often the (realistic) assumption M/B > B is made. In such a situation, an
additional (non-trivial) factor of logB M > 2 is lost.
116 Christian Breimann and Jan Vahrenhold

number of data objects (see Figure 6.2(b)).2 Each of these strips is then
processed using a sweep over the data and eventually by recursion. How-
ever, in contrast to the description of the internal case, the sweep-line is
perpendicular to the y-axis, and sweeping is done from top to bottom. The
motivation behind this modified description is to facilitate the intuition be-
hind the novel ingredient of distribution sweeping, namely the subdivision
into vertical strips.
The subdivision proceeds using a technique originally proposed for distri-
bution sort [17], hence, the resulting external plane sweeping technique has
been christened distribution sweeping [345]. While in the situation of distri-
bution sort all partitioning elements have to be selected using an external
variant of the median find algorithm [133, 307], distribution sweeping can
resort to having an optimal external sorting algorithm at hand. The set of
all x-coordinates is sorted in ascending order, and for each (recursive) sub-
division of a strip, the Θ(M/B) partitioning elements can be selected from
the sorted sequence spending an overall number of O(N/B) I/Os per level of
recursion.
Using this linear partitioning algorithm as a subroutine, the distribution
sweeping technique can be stated as follows: Prior to entering the recursive
procedure, all objects are sorted with respect to the sweeping direction, and
the set of x-coordinates is sorted such that the partitioning elements can be
found efficiently. During each recursive call, the current data set is parti-
tioned into M/B strips. Objects that interact with objects from other strips
are found and processed during a sweep over the strips, while interactions
between objects assigned to the same strip are found recursively. The recur-
sion terminates when the number of objects assigned to a strip falls below M
and the subproblem can be solved in main memory. If the sweep for finding
inter-strip interactions can be performed using only a linear number of I/Os,
i.e., Θ(N/B) I/Os, the overall I/O complexity for distribution sweeping is
O((N/B) logM/B (N/B)).

6.2.3 The R-tree Spatial Index Structure

Many algorithms proposed in the context of spatial databases assume that

the data set is indexed by a hierarchy of bounding boxes. This assumption
is justiﬁed by the popularity of the R-tree spatial index structure (and its
variants) in academic and commercial database systems.
The R-tree, originally proposed by Guttman [368], is a height-balanced
multiway tree similar to a B-tree. An R-tree stores d-dimensional data objects
approximated by their axis-parallel minimum bounding boxes. For ease of
2
Non-point objects cannot always be assigned to a unique strip because they
may interact with several strips. In such a situation, objects are assigned to a
maximal contiguous interval of strips they interact with. See the discussion of,
e.g., Problem 6.20 for more details on how to deal with such a situation.
6. External Memory Computational Geometry Revisited 117

presentation, we restrict the following discussion to the situation d = 2 and

assume that each data object itself is an axis-parallel rectangle.
The leaf nodes in an R-tree contain Θ(B) data rectangles each, where B
is the maximum fanout of the tree. Internal nodes contain Θ(B) entries of
the form (Ptr ,R), where Ptr is a pointer to a child node and R the minimum
bounding rectangle covering all rectangles which are stored in the subtree
rooted in that child. Each entry in a leaf stores a data object or, in the
general setting, the bounding rectangle of a data object and a pointer to the
data object itself. Since the bounding rectangles stored within internal nodes
are used to guide the insertion, deletion, and querying processes (see below),
they are referred to as routing rectangles, whereas the bounding rectangles
stored in the leaves are called data rectangles. An R-tree for N rectangles
consists of O(N/B) nodes and has height O(logB N ). Figure 6.3 shows an
example of an R-tree for a set of two-dimensional rectangles.

Fig. 6.3. R-tree for data rectangles A, B, C, . . . , I, K, L. The tree in this example
has maximum fanout B = 3.

To insert a new rectangle r into an already existing R-tree with root v,

we select the subtree rooted at v whose bounding rectangle needs least en-
largement to include the new rectangle. The insertion process continues re-
cursively until a leaf is reached, adjusting routing rectangles as necessary.
Since recursion takes place along a single root-to-leaf path, an insertion can
be performed touching only O(logB N ) nodes. If a leaf overflows due to an
insertion, a rebalancing process similar to B-tree rebalancing is triggered,
and therefore R-trees also grow and shrink only at the root. The insertion
path depends not only on the heuristic chosen for breaking ties in case of
non-unique subtrees for recursion, but also on the objects already present in
the R-tree. Hence, there is no unique R-tree for a given set of rectangles, and
different orders of insertion for the same set of rectangles usually result in
different R-trees.
During the insertion process, a new rectangle r might overlap the routing
rectangles of several subtrees of the node v currently visited. However, the
rectangle r is routed to exactly one such subtree. Since the routing rectangle
118 Christian Breimann and Jan Vahrenhold

of this subtree might be extended to include r, the routing rectangles stored

within v can overlap. This overlap directly affects the performance of R-tree
query operations: When querying an R-tree to find all rectangles overlapping
a given query rectangle r, we have to branch at each internal node into
all subtrees whose minimum bounding rectangle overlaps r. (Queries for all
rectangles containing a given query point p can be stated in the same way
by regarding p as an infinitesimally small rectangle.) In the worst case, the
search process has to branch at each internal node into all subtrees which
results in O(N/B) nodes being touched—even though the number of reported
overlapping data rectangles might be much smaller. Intuitively, it is thus
desirable that the routing rectangles stored within a node overlap as little as
possible.
Another heuristic is to minimize the area covered by each routing rect-
angle. As a consequence routing rectangles cover less dead space, i. e., space
covered by a routing rectangle which is not covered by any child, such that
unsuccessful searches may terminate earlier. Similar heuristics are used in
several variants of the R-tree including the R+ -tree [683], the Hilbert R-
tree [442], and the R* -tree [102], which is widely recognized to be the most
practical R-tree variant. This is especially due to the fact that the heuristics
used in the R* -tree re-insert a certain number of elements if routing rectan-
gles have to be split. This usually results in a re-structured tree with less
overlapping of routing rectangles permitting fast answers for queries. We re-
fer the interested reader also to the Generalized Search Tree [50, 389] and to
more detailed overviews [323, 754].
As mentioned above, overlapping routing rectangles decrease the query
performance of R-trees, and with increasing dimension, this overlap grows
rapidly. Therefore, other data structures, e.g., the X-tree [114], which uses
so-called supernodes permitting a sequential scan of their children, have been
developed. But as the percentage of the data space covered by routing rect-
angles grows quickly with increasing dimensionality, for d > 10, nearly ev-
ery node is accessed when querying the data structure as long as nodes are
split in a balanced way. For many data distributions, a sequential scan can
have better query performance in terms of overall running time than the
random I/Os caused by querying data structures which are based on data-
partitioning [758]. With the Pyramid-Technique [113], points and ranges in
d-dimensional data space are transformed to 1-dimensional values which can
be stored and queried using any 1-dimensional data structure, e.g., a B+ -tree.
The authors claim that the Pyramid-Technique using a B+ -tree outperforms
not only the data structures presented above but also the sequential scan.
We have presented some hierarchical spatial index structures which are
used to efficiently store and query multi-dimensional data objects. From now
on, whenever we refer to a hierarchical spatial index structure, any of these
structures may be used unless explicitly stated otherwise.
6. External Memory Computational Geometry Revisited 119

6.3 Problems Involving Sets of Points

The ﬁrst problem we discuss in this section is not only one of the most
fundamental problems studied in Computational Geometry but also one of
the rare problems where ﬁnding an optimal external algorithm for the two-
dimensional case is completely straightforward.

(a) Planar convex hull. (b) Intersection of halfspaces in dual space.

Fig. 6.4. Computing the convex hull of a ﬁnite point set.

Problem 6.1 (Convex Hull). Given a set S of N points in IRd , find the
smallest (convex) polytope enclosing S (see Figure 6.4(a)).
Among the earliest internal memory algorithms for computing the convex
hull in two dimensions was a sort-and-scan algorithm due to Graham [352].
This algorithm, called Graham’s Scan, is based upon the invariant that when
traversing the boundary of a convex polygon in counterclockwise direction,
any three consecutive points form a left turn. The algorithm first selects
a point p that is known to be interior to the convex hull, e.g., the center of
gravity of the triangle formed by three non-collinear points in S. All points in
S are then sorted by increasing polar angle with respect to p. The convex hull
is constructed by pushing the points onto a stack in sorted order, maintaining
the above invariant. As soon as the next point to be pushed and the topmost
two points on the stack do not form a left turn, points are repeatedly removed
from the stack until only one point is left or the invariant is fulfilled. After all
points have been processed, the stack contains the points lying on the convex
hull in clockwise direction. As each point can be pushed onto (removed from)
the stack only once, Θ(N ) stack operations are performed, and the (optimal)
internal memory complexity, dominated by the sorting step, is O(N log2 N ).
This algorithm is one of the rare cases where externalization is completely
straightforward [345]. Sorting can be done using O((N/B) logM/B (N/B))
I/Os [17], and an external stack can be implemented such that Θ(N ) stack
operations require O(N/B) I/Os (see Chapter 2). The external algorithm we
obtain this way has an optimal complexity of O((N/B) logM/B (N/B)).
In general, O(N ) points of S can lie on the convex hull, but there are
situations where the number Z of points on the convex hull is (asymptotically)
much smaller. An output-sensitive algorithm for computing the convex hull
120 Christian Breimann and Jan Vahrenhold

in two dimensions has been obtained by Goodrich et al. [345]. Building upon
the concept of marriage-before-conquest [458], the authors combine external
versions of finding the median of an unsorted set [17] and of computing the
convex hull of a partially sorted point set [343] to obtain an optimal output-
sensitive external algorithm with complexity O((N/B) logM/B (Z/B)).
Independent from this particular problem, Hoel and Samet [402] claimed
that accessing disjoint decompositions of data space tends to be faster than
other decompositions for a wide range of hierarchical spatial index struc-
tures. Along these lines, Böhm and Kriegel [138] presented two algorithms
for solving the Convex Hull problem using spatial index structures. One al-
gorithm, computing the minimum and maximum values for each dimension
and traversing the index depth-first, is shown to be optimal in the number of
disk accesses as it reads only the pages containing points not enclosed by the
convex hull once. The second algorithm performs worse in terms of I/O but
needs less CPU time. It is unclear, however, how to extend these algorithms
to higher dimensions.
An approach to the d-dimensional Convex Hull problem is based on the
observation that the convex hull of S ⊂ IRd can be inferred from the inter-
section of halfspaces in the dual space (IRd )∗ [781] (see also Figures 6.4(a)
and (b)). For each point p ∈ S, the corresponding dual halfspace is given by

p∗ := {x ∈ (IRd )∗ | di=1 xi pi ≤ 1}. At least for d ∈ {2, 3}, the intersection of
halfspaces can be computed I/O-efficiently (see the following Problem 6.2),
and this results in corresponding I/O-efficient algorithms for the Convex Hull
problem in these dimensions.

Problem 6.2 (Halfspace Intersection). Given a set S of N halfspaces in

IRd , compute the common intersection of these halfspaces.

In the context of the Halfspace Intersection problem, eﬃcient external

algorithms are known only for the situation d ≤ 3. The intersection in three
dimensions can be computed by either using an externalization of Reif and
Sen’s parallel algorithm [630] (as proposed by Goodrich et al. [345]) or by an
algorithm that can be derived in the framework of randomized incremental
construction with gradations [228] (see Section 6.4). Both algorithms require
O((N/B) logM/B (N/B)) I/Os (for the ﬁrst approach, this bound holds with
high probability, while it is the expected complexity for the second approach).
The problem we discuss next has already been mentioned in the context
of solving problems by reduction (Section 6.2.1):

Problem 6.3 (Closest Pair). Given a set S of N points in IRd and a

distance metric d, ﬁnd a pair (p, q) ∈ S × S, p = q, for which d(p, q) =
min{d(r, s) | r, s ∈ S, r = s} (see Figure 6.5(a)).

There is a variety of optimal algorithms in the internal memory setting

that solve the problem either directly or by exploiting reductions to other
6. External Memory Computational Geometry Revisited 121

(a) Closest pair. (b) Nearest neighbor for query point p.

Fig. 6.5. Closest-point problems.

problems (see also the survey by Smid [700]). In the external memory set-
ting, the (static) problem of finding the closest pair in a fixed set S of N
points can be solved by exploiting the reduction to the All Nearest Neighbors
problem (Problem 6.6), where for each point p ∈ S, we wish to determine
its nearest neighbor in S \ {p} (see Problem 6.5). Having computed this list
of N pairs of points, we can easily select two points forming a closest pair
by scanning the list while keeping track of the closest pair seen so far. As
we will discuss below, the complexity of solving the All Nearest Neighbors
is O((N/B) logM/B (N/B)), which gives us an optimal algorithm for solving
the static Closest Pair problem.
Handling the dynamic case is considerably more involved, as an insertion
or a deletion could change a large number of “nearest neighbors”, and con-
sequently, the reduction to the All Nearest Neighbors problem would require
touching at least the same number of objects.
Callahan, Goodrich, and Ramaiyer [168] introduced an external variant
of topology trees [316], and building upon this data structure, they managed
to develop an external version of the dynamic closest pair algorithm by Be-
spamyatnikh [121]. The data structure presented by Callahan et al. can be
used to dynamically maintain the closest pair spending O(logB N ) I/Os per
update.
The Closest Pair problem can also be considered in a bichromatic setting,
where each point is labeled with either of two colors, and where we wish to
report a pair with minimal distance among all pairs of points having different
colors [10, 351]. This problem can be generalized to the case of reporting the
K bichromatic closest pairs.
Problem 6.4 (K-Bichromatic Closest Pairs). Given a set S of N points
in IRd with S = S1 ∪ S2 and S1 ∩ S2 = ∅, find K closest pairs (p, q) ∈ S1 × S2 .
Some efficient internal memory algorithms for solving this problem have
been proposed [10, 451], but it seems that none of them can be externalized
efficiently. In the context of spatial databases, the K-Bichromatic Closest
Pairs problem can be seen as a special instance of a so-called θ-join which
is defined as follows: Given two sets S1 and S2 of objects and a predicate θ :
S1 × S2 → IB, compute all pairs (s1 , s2 ) ∈ S1 × S2 , for which θ(s1 , s2 ) = true.
In his approach to the K-Bichromatic Closest Pairs problem,Henrich [393]
considered the special case |S2 | = 1, and assuming that S1 is indexed hier-
122 Christian Breimann and Jan Vahrenhold

archically, he proposed to perform a priority-driven traversal of the spatial

index structure storing S1 . Hjaltason and Samet [401] later generalized this
approach and referred to Problem 6.4 as a special instance of a θ-join, namely
the incremental distance join (again assuming that each relation is indexed hi-
erarchically). Their algorithm schedules a priority-driven synchronous traver-
sal of both trees, repeatedly looking at two nodes, one from each tree. The
processing is guided by the distance between the (routing) rectangles corre-
sponding to the nodes, and to each pair this distance is assigned as the pair’s
priority. Initially, the priority queue contains all pairs that can be formed by
grouping the root of one tree and the children of the root of the other tree,
and the first element in the queue always forms the closest pair of objects
stored in the queue. For each removed pair of nodes, the pairs formed by
the children (if any) are inserted into the queue. Whenever a pair of data
objects appears at the front of the queue, its associated distance is minimal
among all unconsidered distances, hence, all K bichromatic closest pairs can
be reported ordered by increasing distance. This algorithm benefits from the
observation that in practical applications K |S1 ×S2 |, but nevertheless, the
priority queue might contain a large number of pairs. Hjaltason and Samet
described several approaches for how to organize the priority queue such that
only a small portion of it actually resides in main memory. This means that
only the promising candidate pairs are kept in main memory whereas all pairs
having a large distance are off-loaded to external memory. The authors argue
that except for unlikely worst-case configurations, their approaches perform
without accessing the off-loaded data and that worst-case configurations can
be handled gracefully as well. Worst-case optimal external priority queues
are also discussed in Chapter 2 and Chapter 3.
Corral et al. [219] presented a collection of algorithms that improve the
effective running time of the above algorithms for solving the Bichromatic
Closest Pair problem. These improvements include a separate treatment for
the case K = 1 and choosing a heap-based priority queue.
These algorithms for solving Problem 6.4 can be modified to solve (the
monochromatic) Problem 6.3. This modification is not generally possible [219]
for an arbitrary algorithm solving Problem 6.4. A description of modifications
for the latter algorithm has been given by Corral et al. [218]. The authors
claim that these modifications do not seriously affect the performance of their
algorithm.
As mentioned before, spatial index structures try to cluster objects based
upon their spatial location, and consequently, several approaches have been
made to exploit this structural property when dealing with proximity prob-
lems. A fundamental proximity problem is to organize a set of points such
that for each query point the point closest to it can be reported quickly.

Problem 6.5 (Nearest Neighbor). Given a set S of N points in IRd , a

distance metric d, and a query point p in IRd , report a point q ∈ S, for which
d(p, q) = min{d(p, r) | r ∈ S} (see Figure 6.5(b)).
6. External Memory Computational Geometry Revisited 123

Problem 6.5 and its relatives occur in a variety of conventional geographi-

cal applications, e.g., when searching for the closest geometric feature of some
kind relative to some given spatial location. This problem is also referred to
as the Post Office problem [460]3 . Since the metric d defining the “closeness”
of two objects is also a parameter in the problem setting, this problem can
be found in new application areas like multimedia database systems. In this
setting, multimedia objects, e.g., text, image, or video objects, are described
by high-dimensional feature vectors which in turn are considered as points in
the feature space. Proximity among these feature vectors implies similarity
between the objects represented, and in combination with carefully chosen
metrics, spatial index structures can be used for efficiently performing simi-
larity search [137, 473, 668, 681].
The Nearest Neighbor problem can also be restated in the context of
Voronoi diagrams (see Problem 6.11), and using techniques by Goodrich
et al. [345], one can obtain a static data structure that answers nearest neigh-
bor queries in O(logB N ) I/Os. We will comment on this approach when
discussing algorithms for computing the Voronoi diagram.

(a) Approximate nearest neighbor. (b) All nearest neighbors.

Fig. 6.6. Nearest-neighbor problems.

A variant of the Nearest Neighbor problem is to compute an approximate

nearest neighbor for a given query point. Here, an additional parameter ε
is used to allow for certain slack in the reported “minimum” distance. For
ε > 0, a (1 + ε)-approximate nearest neighbor of a query point p is a point q
that is no further than (1 + ε) times the distance dist to the actual nearest
neighbor of p (see Figure 6.6(a)).
Using external topology trees, Callahan et al. [168] derived an external
version of the data structure by Arya et al. [72] that can be used to maintain
S under insertions and deletions with O(logB N ) I/Os per update such that
an approximate nearest neighbor query can be answered spending O(logB N )
I/Os.4 Even in the internal memory setting, it is an open problem to find
an efficient dynamic data structure with O(N logO(1) N ) space that can be
used for the exact Nearest Neighbor problem and has O(logO(1) N ) update
3
This reference is ascribed to Knuth as he discusses a data structure called post-
office tree which can be used for answering a query of the kind “What is the
nearest city to point x?”, given the value of x [460, page 563].
4
The constants hidden in the “Big-Oh”-notation depend on d and ε.
124 Christian Breimann and Jan Vahrenhold

and query time [700], and not surprisingly, the external memory variant of
this problem is unsolved as well.
Berchtold et al. [112] proposed to use hierarchical spatial index struc-
tures to store the data points. They also introduced a different cost model
and compared the predicted and actual cost of solving the Nearest Neighbor
problem for real-world data using an X-tree [114] and a Hilbert-R-tree [287].
Brin [148] introduced the GNAT index structure which resembles a hierarchi-
cal Voronoi diagram (see Problem 6.11). He also gave empirical evidence that
this structure outperforms most other index structures for high-dimensional
data spaces.
The practical relevance of the nearest neighbor, however, becomes less
significant as the number of dimensions increases. For both real-world and
synthetic data sets in high-dimensional space (d > 10), Weber, Schek, and
Blott [759] as well as Beyer et al. [123] showed that under several distance
metrics the distance to the nearest neighbor is larger than the distance be-
tween the nearest neighbor and the farthest neighbor of the query point.
Their observation raises an additional quality issue: The exact nearest neigh-
bor of a query point might not be relevant at all. As an approach to cope with
this complication, Hinneburg, Aggarwal, and Keim [397] modified the Nearest
Neighbor problem by introducing the notion of important dimensions. They
introduced a quality criterion to determine which dimensions are relevant to
the specific proximity problem in question and examined the data distribu-
tion resulting from projections of the data set to these dimensions. Obviously,
their approach yields improvements over standard techniques only if the num-
ber of “important” dimensions is significantly smaller than the dimension of
the data space.

Problem 6.6 (All Nearest Neighbors). Given a set S of N points in IRd

and a distance metric d, report for each point p ∈ S a point q ∈ S, for which
d(p, q) = min{d(p, r) | r ∈ S, p = r} (see Figure 6.6(b)).

The All Nearest Neighbors problem, which can also be seen as a special
batched variant of the (single-shot) Nearest Neighbor problem, can be posed,
e.g., in order to ﬁnd clusters within a point set. Goodrich et al. [345] pro-
posed an algorithm with O((N/B) logM/B (N/B)) I/O-complexity based on
the distribution sweeping paradigm: Their approach is to externalize a par-
allel algorithm by Atallah and Tsay [74] replacing work on each processor by
work within a single memory load. Recall that on each level of distribution
sweeping, only interactions between strips are handled, and that interactions
within a strip are handled recursively. In the situation of ﬁnding nearest
neighbors, the algorithm performs a top-down sweep keeping track of each
point whose nearest neighbor above does not lie within the same strip. The
crucial observation by Atallah and Tsay is that there are at most four such
points in each strip, and by choosing the branching factor of distribution
sweeping as M/(5B), the (at most) four blocks per strip containing these
6. External Memory Computational Geometry Revisited 125

points as well as the M/(5B) blocks needed to produce the input for the
recursive steps can be kept in main memory. Nearest neighbors within the
same strip are found recursively, and the result is combined with the result
of a second bottom-up sweep to produce the ﬁnal answer.
In several applications, it it desirable to compute not only the exact near-
est neighbors but to additionally compute for each point the K points clos-
est to it. An algorithm for this so-called All K-Nearest Neighbors problem
has been presented by Govindarajan et al. [346]. Their approach (which
works for an arbitrary number d of dimensions) builds upon an exter-
nal data structure to eﬃciently maintain a well-separated pair decomposi-
tion [169]. A well-separated pair decomposition a set S of points is a hi-
erarchical clustering of S such that any two clusters on the same level of
the hierarchy are farther apart than any to points within the same clus-
ter, and several internal memory algorithms have been developed building
upon properties of such a decomposition. The external data structure of
Govindarajan et al. occupies O(KN/B) disk blocks and can be used to
compute all K-nearest neighbors in O((KN/B) logM/B (KN/B)) I/Os. Their
method can also be used to compute the K closest pairs in d dimensions in
O(((N + K)/B) logM/B ((N + K)/B)) I/Os using O((N + K)/B) disk blocks.

(a) Reverse nearest neighbors for point p. (b) K-nearest neighbors via lifting.

Fig. 6.7. Non-standard nearest-neighbor problems.

Problem 6.7 (Reverse Nearest Neighbors). Given a set S of N points

in IRd , a distance metric d, and a query point p in IRd , report all points q ∈ S,
for which d(q, p) = min{d(q, r) | r ∈ (S ∪ {p}) \ {q}} (see Figure 6.7(a)).
The Reverse Nearest Neighbors problem has been introduced in the spatial
database setting by Korn and Muthukrishnan [472] who also presented static
and dynamic solutions for the bichromatic an monochromatic problem. For
simplicity, we only discuss the solution to the static monochromatic problem
here, as for their approach only minor modiﬁcations are needed to solve
the other three problems. In a preprocessing step, the All Nearest Neighbors
problem is solved for S. Each point q and its nearest neighbor r deﬁne a ball
centered at q with radius d(q, r). All N such balls are stored in a spatial
index structure that can be used to report, given a query point p, all balls
126 Christian Breimann and Jan Vahrenhold

containing p. It is easy to verify that the points corresponding to the balls that
contain p are exactly the points having p as their nearest neighbor in S ∪ {p}.
In the internal memory setting, at least the static version of the Reverse
Nearest Neighbor problem can be solved efficiently [524]. The main problem
when trying to efficiently solve the problem in a dynamic setting is that
updating S essentially involves finding nearest neighbors in a dynamically
changing point set, and—as discussed in the context of Problem 6.5—no
efficient solution with at most polylogarithmic space overhead is known.

Problem 6.8 (K-Nearest Neighbors). Given a set S of N points in IRd ,

an integer K with 1 ≤ K ≤ N , and a query point p in IRd , report K points
qi ∈ S closest to p.

Agarwal et al. [6] solved the two-dimensional K-Nearest Neighbors prob-

lem in the dual setting: using a duality transform, they proposed to map each
two-dimensional point (a1 , a2 ) to the hyperplane z = a21 + a22 − 2a1 x − 2a2 y
which is tangent to the unit parabola at the (lifted) point (a1 , a2 , a21 + a22 ).
In this setting, the problem of finding the K nearest neighbors for a point
p = (xp , yp ) can be restated as finding the K highest hyperplanes above the
point (xp , yp , 0) (For the sake of simplicity, the corresponding one-dimensional
problem is sketched in Figure 6.7(b). Consider, e.g., point O: The two highest
hyperplanes lying above O are defined by lifting points B and C which are
also the two nearest neighbors of O). Using an external version of Chan’s al-
gorithm for computing (≤ k)-levels of an arrangement [175], Agarwal et al. [6]
developed a data structure for range searching among halfplanes that, after
spending O((N/B) log2 N logB N ) expected I/Os for preprocessing, occupies
an expected number of O((N/B) log2 (N/B)) disk blocks. This data struc-
ture can be used to report the K highest halfplanes above a query point,
and by duality, the K nearest neighbors in the original setting, spending
O(logB N + K/B) expected I/Os per query.
In addition to the quite involved data structure mentioned above, spatial
index structures have been considered to solve the K-Nearest Neighbor prob-
lem [190, 473, 640, 681]. Much attention has been paid to pruning parts of
the candidate set [681] and to removing inefficient heuristics [190]. As men-
tioned above, the performance of most index structures degrades for high
dimensions, and even while the Pyramid-Technique [113] can be used for
uniformly distributed data in high dimensions, its performance degrades for
non-uniformly distributed data. To overcome this deficiency, Yu et al. [774]
presented a new approach called iDistance which is adaptable with respect
to data distribution. They propose to partition the data space according to
its characteristics and, for each partition, to index the distance between con-
tained data points and a reference point using a B+ -tree. Their algorithm
can be used to incrementally refine approximate answers such that early dur-
ing the algorithm, approximate results can be output if desired. In contrast,
the VA-file of Weber et al. [758, 759] uses approximated data to produce a
6. External Memory Computational Geometry Revisited 127

set of candidate pairs during nearest neighbor search. It partitions the data
space into cells and stores unique bit strings for these cells in an (option-
ally compressed) array. During a sequential scan of this array, candidates are
determined by using the stored approximations, before these candidates are
further examined to obtain the final result.
Establishing a trade-off between used disk space and obtained query time,
Goldstein and Ranakrishnan [338] presented an approach to reduce query
time by examining some characteristics of the data and storing redundant
information. Following their approach the user can explicitly relate query
performance and disk space, i.e., more redundant information can be stored
to improve query performance and vice versa. With a small percentage of
only approximately correct answers in the final result, this approach leads to
sub-linear query processing for high dimensions.
The description of algorithms for the K-Nearest Neighbors problem con-
cludes our discussion of proximity problems, that is of selecting certain points
according to their proximity to one or more query points. The next two prob-
lems also consist of selecting a subset of the original data, namely the set
contained in a given query range. These problems, however, have been dis-
cussed in detail by recent surveys [11, 56, 754], so we only sketch the main
results in this area.

(a) Halfspace range searching. (b) Orthogonal range searching.

Fig. 6.8. Range searching problems.

Problem 6.9 (Halfspace Range Searching). Given a set S of N points

d
in IR and a vector a ∈ IRd , report all Z points x ∈ S, for which xd ≤
d−1
ad + i=1 ai xi .

The main source for solutions to the halfspace range searching problem
in the external memory setting is the paper by Agarwal et al. [6]. The au-
thors presented a variety of data structures that can be used for halfspace
range searching classifying their solutions in linear and non-linear space data
structures. All proposed algorithms rely on the following duality transform
and the fact that it preserves the “above-below” relation.

d−1
G d → IRd : xd = ad + i=1 ai xi → (a1 , . . . , ad )
D: d−1
IRd → G d : (b1 , . . . , bd ) → xd = bd − i=1 bi xi
128 Christian Breimann and Jan Vahrenhold

In the linear space setting, the general problem for d > 3 can be solved
using an external version of a partition tree [535] spending for any ﬁxed ε > 0
O((N/B)1−1/d+ε + Z/B) I/Os per query. The expected preprocessing com-
plexity is O(N log2 N ) I/Os. For simplex range searching queries, that is for
reporting all points in S lying inside a given query simplex with µ faces of all
dimensions, O((µN/B)1−1/d+ε +Z/B) I/Os are suﬃcient. For halfspace range
searching and d = 2, the query cost can be reduced to O(logB N +Z/B) I/Os
(using O(N log2 N logB N ) expected I/Os to preprocess an external version of
a data structure by Chazelle, Guibas, and Lee [184]). Using partial rebuild-
ing, points can also be inserted into/removed from S spending amortized
O(log2 (N/B) logB N ) I/Os per update.
If one is willing to spend slightly super-linear space, the query cost in the
three-dimensional setting can be reduced to O(logB N +Z/B) I/Os at the ex-
pense of an expected overall space requirement of O((N/B) log2 (N/B)) disk
blocks. This data structure externalizes a result of Chan [175] and can be con-
structed spending an expected number of O((N/B) log2 (N/B) logB N ) I/Os.
Alternatively, Agarwal et al. [6] propose to use external versions of shallow
partition trees [536] that use O((N/B) logB N ) space and can answer a query
spending O((N/B)ε +Z/B) I/Os. This approach can also be generalized to an
arbitrary number d of dimensions: a halfspace range searching query can be
answered spending O((N/B)1−1/d/2+ε + Z/B) I/Os. The exact complex-
ity of halfspace range searching is unknown—even in the well-investigated
internal memory setting, there exist several machine model/query type com-
binations where no matching upper and lower bounds are known [11].

Problem 6.10 (Orthogonal Range Searching). Given a set S of N

points in IRd and d (possibly unbounded) intervals [li , ri ], report all Z points
x ∈ S for which x ∈ [l1 , r2 ] × . . . × [ld , rd ].

The more restricted Orthogonal Range Searching problem can obviously

be solved by storing the data points (considered as infinitesimally small rect-
angles) in a spatial index structure and by performing a range query (window
query). The actual query time, however, depends on the heuristic for clus-
tering nodes, and in the worst case, the index structure has to be traversed
completely—even if Z ∈ O(1). Despite this disadvantage, most of these in-
dex structures occupy only linear space and support updates I/O-efficiently.
Occupying only linear space has been recognized as a conceptual advantage
that may cancel the disadvantage of a theoretically high query cost, and the
notion of indexability has been introduced to investigate possible trade-offs
between storage redundancy and access overhead in the context of range
searching [388].
An external data structure that uses linear space and efficiently supports
both updates and queries has been proposed by Grossi and Italiano [360].
The authors externalized their internal memory cross-tree, which can be seen
as a cross-product of d one-dimensional index structures, and obtained a data
6. External Memory Computational Geometry Revisited 129

structure that can be updated in O(logB N ) I/Os per update and orthogonal
range queries in O((N/B)1−1/d + Z/B) I/Os per query. The external cross-
tree can be built in O((N/B) logM/B (N/B)) I/Os. In a different model that
excludes threaded data structures like the cross-tree, Kanth and Singh [444]
obtained similar bounds (but with amortized update complexity) by layering
B-trees and k-D-trees. Their paper additionally includes a proof of a matching
lower bound.
The Orthogonal Range Searching problem has also been considered in
the batched setting: Arge et al. [65] and Goodrich et al. [345] showed how
to solve the two-dimensional problem spending O((N/B) logM/B (N/B) +
Z/B) I/Os using linear space. Arge et al. [65] extended this result to higher
dimensions and obtained a complexity of O((N/B) logM/B d−1
(N/B) + Z/B)
I/Os. The one-dimensional batched dynamic problem, i.e., all Q updates are
known in advance, can be solved in O(((N +Q)/B) logM/B (N +Q)/B +Z/B)
I/Os [65], but no corresponding bound is known in higher dimensions.
Problems that are slightly less general than the Orthogonal Range Search-
ing problem are the (two-dimensional) Three-Sided Orthogonal Range Search-
ing and Two-Sided Orthogonal Range Searching problem, where the query
range is unbounded at one or two sides. Both problems have been consid-
ered by several authors [129, 421, 443, 624, 709, 750], most recently by Arge,
Samoladas, and Vitter [67] in the context of indexability [388]—see also more
specific surveys [11, 56, 754].
Another recent development in the area of range searching are algorithms
for range searching among moving objects. In this setting, each object is
assigned a (static) “flight plan” that determines how the position of an object
changes as a (continuous) function of time. Using external versions of partition
trees [535], Agarwal, Arge, and Erickson [5] and Kollios and Tsotras [463]
developed efficient data structures that can be used to answer orthogonal
range queries in one and two dimensions spending O((N/B)1/2+ε + Z/B)
I/Os. These solutions are time-oblivious in the sense that the complexity of
a range query does not depend on how far the point of time of the query
is in the future. Time-responsive solutions that answer queries in the near
future (or past) faster than queries further away in time have been proposed
by Agarwal et al. [5] and by Agarwal, Arge, and Vahrenhold [8].
We conclude this section by discussing the Voronoi diagram and its graph-
theoretic dual, the Delaunay triangulation. Both structures have a variety of
proximity-related applications, e.g., in Geographic Information Systems, and
we refer the interested reader to more specific treatments of how to work with
these structures [76, 275, 336].
Problem 6.11 (Voronoi Diagram). Given a set S of N points in IRd
and a distance metric d, compute for each point p ∈ S its Voronoi region
V (p, S) := {x ∈ IRd | d(x, p) ≤ d(x, q), q ∈ S \ {p}}.
Given the above definition, the Voronoi diagram consists of the union of all
N Voronoi regions which are disjoint except for a possibly shared boundary.
130 Christian Breimann and Jan Vahrenhold

(a) Voronoi diagram via lifting. (b) Delaunay triangulation via lifting.

Fig. 6.9. Computing the Voronoi diagram and the Delaunay triangulation.

An optimal algorithm for computing the Voronoi diagram can be obtained by

a transformation already used for solving the K-Nearest Neighbors problem
(Problem 6.8). The key idea is that a Voronoi region for a point p ∈ S contains
exactly those points in IRd that have p as their nearest neighbor with respect
to S. To compute the Voronoi diagram in d dimensions, each point is lifted to
the (d + 1)-dimensional unit parabola, and the intersection of the halfspaces
dual to these points is computed (see Figure 6.9(a)). As already mentioned
in the discussion of the K-Nearest Neighbors problem (Problem 6.8), the
highest plane above a d-dimensional point is dual to the lifted version of its
nearest neighbor [275], and consequently, the projection of the intersection
of halfspaces back to d-dimensional space results in the Voronoi diagram. As
the intersection of halfspaces can be computed efficiently in two and three
dimensions (see Problem 6.2), the Voronoi diagram in one and two dimensions
can be constructed using the above transformation. It should be noted that
a similar transformation, namely computing the convex hull (Problem 6.1)
of the lifted points (see Figure 6.9(b)) can be used to compute the graph-
theoretic dual of the Voronoi diagram, the Delaunay triangulation, in two
and three dimensions.
The Voronoi diagram can be used to solve the (static) Nearest Neighbor
problem (Problem 6.5). This is due to the observation that each query point q
that does not lie on a shared boundary of Voronoi regions falls into exactly one
Voronoi region, say the region belonging to some point p ∈ S. By definition,
this region contains all points in the plane that are closer to p than to any
other point of S, that is, all points for which p is the nearest neighbor with
respect to S. In order to find the region containing the query point q, one has
to solve the Point Location problem. An algorithm for solving this problem—
formally defined as Problem 6.18 in Section 6.4—can be used to answer a
Nearest Neighbor query for a static set S in O(logB N ) I/Os.
In the internal memory setting, a variety of two-dimensional problems can
be solved by using either the Voronoi diagram or the Delaunay triangulation.
Almost all these solutions require one of these structures to be traversed, and
as both structures are planar graphs, we refer to Chapter 5 for details on the
external memory complexity of such traversals.
6. External Memory Computational Geometry Revisited 131

6.4 Problems Involving Sets of Line Segments

We begin this section by stating a geometric problem that is inherently one-
dimensional even though it is formulated in a two-dimensional setting. This
problem serves also as a vehicle for introducing the interval tree data struc-
ture. The external memory version of this data structure is a building block
for several eﬃcient algorithms and its description can also be used to demon-
strate design techniques for externalizing data structures.

(a) Stabbing a set of segments. (b) Stabbing a set of intervals.

Fig. 6.10. Reducing the Segment Stabbing problem to a one-dimensional setting.

Problem 6.12 (Segment Stabbing). Given a set S of N segments in the

plane and a vertical line = x, compute all Z segments in S intersected by
(see Figure 6.10(a)).
The key observation leading towards an optimal algorithm is that the seg-
ments stabbed by are exactly those segments in S whose projections onto
the x-axis contain the point (x, 0) (see Figure 6.10(b)). In the internal mem-
ory setting, this reduced problem can be solved optimally, that is spending
O(log2 N +Z) time and linear space, by using the so-called interval tree [274].
An interval tree is a perfectly balanced binary search tree over the set of x-
coordinates of all endpoints in S (hereafter referred to as “x-coordinates in
S”), and data elements are stored in internal nodes as well as in leaf nodes.
Each node corresponds to the median of (interval of) all x-coordinates in S
stored in the subtree rooted at that node, e.g., the root corresponds to the
median of all x-coordinates in S. The x-coordinate stored at an internal node
v naturally partitions the set stored in the corresponding subtree into two
slabs, and a segment in S is stored in a secondary data structure associated
with v, if and only if it crosses the boundary between these slabs and does
not cross any slab boundary induced by v’s parent. The interval tree stor-
ing S can be updated (that is insertions and deletions can be performed) in
O(log2 N ) time per update.5
Arge and Vitter [71] obtained an optimal external memory solution for the
Segment Stabbing problem by developing an external version of the interval
5
The insertion bound is amortized if the set of x-coordinates in S is augmented
due to this insertion.
132 Christian Breimann and Jan Vahrenhold

tree. Their data structure occupies linear space and can be used to answer
stabbing queries spending O(logB N +Z/B) I/Os per query. As in the internal
setting, the data structure can be made dynamic, and the resulting dynamic
data structure supports both insertions and deletions with O(logB N ) worst-
case I/O-complexity.
The externalization technique used by Arge and Vitter is of independent
interest, hence, we will present it in a little more detail. In order to obtain a
query complexity of O(logB N +Z/B) I/Os, the fan-out of the base tree has to
be in O(B c ) for some constant c > 0, and for reasons that will become clear
immediately, this constant is chosen as c = 1/2. As mentioned above, the
boundaries between the children of a node v are stored at v and partition the
interval associated with v into consecutive slabs, and a segment s intersecting
the boundary of such a slab (but of no slab corresponding to a child of v’s
parent) is stored at v. The slabs intersected by s form a contiguous subinterval
[sl , sr ] of [s1 , s√B ]. In the situation of Figure 6.11(a), for example, the segment
s intersects the slabs s1 , s2 , s3 , and s4 , hence, l = 1 and r = 4. The indices l
and r induce a partition of s into three (possibly empty) subsegments: a left
subsegment s∩sl , a middle subsegment s∩[sl+1 , sr−1 ], and a right subsegment
s ∩ sr . √
Each of the B slabs associated with a node v has a left and right struc-
ture that stores left and right subsegments falling into the slab. In the situa-
tion of the interval tree, these structures are lists ordered by the x-coordinates
of the endpoints that do not lie on the slab boundary. Handling of middle
subsegments is complicated by the fact that a subsegment might span more
that one slab, and storing the segment at each such slab would increase both
space requirement and update time. To resolve this problem, Arge and Vitter
introduced the notion of multislabs: a multislab is a contiguous √ √ subinterval
of [s1 , s√B ], and it is easy to realize that there are Θ( B B) = Θ(B) such
multislabs. Each middle subsegment is stored in a secondary data structure
corresponding to the (unique) maximal multislab it spans, and as there are
only Θ(B) multislabs, the node v can accommodate pointers to all these
structures in O(1) disk blocks.6
As in the internal memory setting, a stabbing query with = x is answered
by performing a search for x and querying all secondary structures of the
nodes visited along the path. As the tree is of height O(logB N ), and as
each left and right structure that contributes Z ≥ 0 elements to the answer
set can be queried in O(1 + Z /B) I/Os, the overall query complexity is
O(logB N + Z/B) I/Os.7
6
To ensure that the overall space requirement is O(N/B) disk blocks, multislab
lists containing too few segments are grouped together into a special underﬂow
structure [71].
7
Note that each multislab structure queried contributes
√ all its elements to the
answer set, hence, the complexity of querying O( B log B N ) multislab structures
is O(Z/B).
6. External Memory Computational Geometry Revisited 133

The main problem with making the interval tree dynamic is that the in-
sertion of a new interval might augment the set of x-coordinates in S. As a
consequence, the base tree structure of the interval tree has to be reorganized,
and this in turn might require several segments moved between secondary
structures of diﬀerent nodes. Using weight-balanced B-trees (see Chapter 2)
and a variant of the global rebuilding technique [599], Arge and Vitter ob-
tained a linear-space dynamic version of the interval tree that answers stab-
bing queries in O(logB N + Z/B) I/Os and can be updated in O(logB N )
I/Os worst-case.

(a) A node in an external interval tree. (b) A diagonal corner query.

Fig. 6.11. Diﬀerent approaches to the Segment Stabbing problem.

A completely diﬀerent approach to solving the Segment Stabbing problem

is to regard this problem as a special case of two-sided range searching in two
dimensions, namely as a so-called diagonal corner query. By regarding the
(one-dimensional) interval [xl , xr ] as the two-dimensional point (xl , xr ) lying
above the main diagonal, a stabbing query for the vertical line = x corre-
sponds to a two-sided range query with apex at (x, x) (see Figure 6.11(b)).
As diagonal corner queries can be answered by any data structure proposed
for two-dimensional (two-sided, three-sided, or general orthogonal) range
searching, all solutions discussed for the Orthogonal Range Searching problem
(Problem 6.10) can be applied to the Segment Stabbing problem.
We now state a problem that occurs as a preprocessing step in a variety
of other problems.

Problem 6.13 (Segment Sorting). Given a set S of N non-intersecting

segments in the plane, compute the partial order given by the “above-below”
relation and extend this order to a total order on S.

Computing a total order on a set of non-intersecting segments in the

plane has important applications, e.g., for the Vertical Ray-Shooting prob-
lem [69, 613] (see Problem 6.17) or the Bichromatic Segment Intersection
problem [70]. The solution to the Segment Sorting problem makes use of
what is called an extended external segment tree. This data structure has
been proposed for solving the Endpoint Dominance problem which we dis-
cuss next.
134 Christian Breimann and Jan Vahrenhold

(a) Endpoint dominance. (b) Trapezoidal decomposition.

Fig. 6.12. Problems involving multiple query points.

Problem 6.14 (Endpoint Dominance). Given a set S of N non-inter-

secting segments in the plane, find for each endpoint of a segment in S the
segment in S (if any) directly above this endpoint (see Figure 6.12(a)).
Even though it seems that the Endpoint Dominance problem could be
solved by repeatedly querying an external interval tree,8 the main motiva-
tion behind developing a different approach is that the Endpoint Dominance
problem is a batched static problem. For batched static problems, there is no
need to employ a data structure whose I/O-complexity per single operation is
worst-case optimal. Instead, a better overall I/O-complexity can be obtained
by building on certain aspects of lazy data processing as in the buffer tree
data structure (see Chapter 2).
As the interval tree data structure, the segment tree is a data structure for
storing a set of one-dimensional intervals [108, 614]. The main idea again is to
organize the x-coordinates in S as a binary search tree, but this time the x-
coordinates are stored exclusively in the leaves of the tree. For x[1] , . . . , x[2N ]
denoting the sorted sequence of x-coordinates in S and 1 ≤ i ≤ 2N − 1,
the i-th leaf (in left-to-right order) corresponds to the interval [x[i] , x[i+1] [
while the 2N -th leaf corresponds to the point x[2N ] . An internal node then
corresponds to the union of all intervals stored in the subtree below it. A
segment is stored at each node v, where it (or rather its projection onto the
x-axis) contains the interval corresponding to v, and this implies that each
segment can be stored in up to two nodes per level. This in turn implies
that an external segment tree occupies O((N/B) logM/B (N/B)) blocks, and
consequently each algorithm that relies on a set of segments being sorted
requires the same amount of (temporary) disk space for at least the duration
of the preprocessing step.
An external segment tree as proposed by Arge, Vengroff, and Vitter [70]
can be seen as a hierarchical representation of the slabs visited during an
algorithm based upon distribution sweeping. Corresponding to this intuition,
the tree can be constructed efficiently top-down, distributing middle subseg-
ments to secondary multislab structures. This requires that one block for each
8
In fact, a solution can be obtained using an augmented version of this data
structure (see Problem 6.17).
6. External Memory Computational Geometry Revisited 135

multislab can be held in main memory, and since the number of multislabs is
quadratic in the number of slabs, the number of slabs, that is, the fan-out of
the base tree (and
thus of the corresponding distribution sweeping process),
is chosen as Θ( M/B).9
To facilitate finding the segment immediately above another segment’s
endpoint, the segments in the multislab structures have to be sorted accord-
ing to the “above-below” relation. Given that the solution to the Endpoint
Dominance problem will be applied to solve the Segment Sorting problem
(Problem 6.13), this seems a prohibited operation. Exploiting the fact,
how-
ever, that the middle subsegments have their endpoints on a set of Θ( M/B)
slab boundaries, Arge et al. [70] demonstrated how these segments can be
sorted in a linear number of I/Os using only a standard (one-dimensional)
sorting algorithm. Extending the external segment tree by keeping left and
right subsegments in sorted order as they are distributed to slabs on the
next level and using a simple counting argument, it can be shown that such
an extended external segment tree can be constructed top-down spending
O((N/B) logM/B (N/B)) I/Os.10
The endpoint dominance queries are then filtered through the tree re-
membering for each query point the lowest dominating segment seen so far.
Filtering is done bottom-up reflecting the fact that the segment tree has
been built top-down. Arge et al. [70] built on the concept of fractional cas-
cading [182] and proposed to use segments sampled from the multislab lists
of a node v to each child (instead of the other way round) as bridges that
help finding the dominating segment in v once the dominating segment in the
nodes below v (if any) has been found. The number of sampled segments is
chosen such that the overall space requirement of the tree does not (asymp-
totically) increase and that, simultaneously for all multislabs of a node v,
all segments between two sampled segments can be held in main memory.
Then, Q queries can be filtered through the extended external segment tree
spending O(((N +Q)/B) logM/B (N/B)) I/Os, and after the filtering process,
all dominating segments are found.
A second approach is based upon the close relationship to the Trape-
zoidal Decomposition problem (Problem 6.15), namely that the solution for
the Endpoint Dominance problem can be derived from the trapezoidal de-
composition spending O(N/B) I/Os. As we will sketch, an algorithm derived
in the framework of Crauser et al. [228] computes the Trapezoidal Decom-
position of N non-intersecting segments spending an expected number of
9

Using a base-tree with M/B fan-out does not asymptotically change the com-
plexity as O((N/B) log√M/B (N/B)) = O((N/B) logM/B (N/B)). More precisely,
the smaller fan-out results in a tree with twice as much levels.
10
At present, it is unknown whether an extended external segment tree can be
built efficiently in a multi-disk environment, that is, whether the complexity of
building this structure is O((N/DB) log M/B (N/B)) I/Os for D ∈ O(1) [70].
136 Christian Breimann and Jan Vahrenhold

O((N/B) logM/B (N/B)) I/Os, hence the Endpoint Dominance problem can
be solved spending asymptotically the same number of I/Os.
Arge et al. [70] demonstrate how the Segment Sorting problem (Prob-
lem 6.13) can be solved by reduction to the Endpoint Dominance problem
(Problem 6.14). Just as for computing the trapezoidal decomposition, two in-
stances of the Endpoint Dominance problem are solved, this time augmented
with horizontal segments at y = +∞ and y = −∞. Based upon the solu-
tion of these two instances, a directed graph G is created as follows: each
segment corresponds to a node, and if a segment u is dominated from above
(from below) by a segment v, the edge (u, v) (the edge (v, u)) is added to the
graph. The two additional segments ensure that each of the original segments
is dominated from above and from below, hence, the resulting graph is a pla-
nar (s, t)-graph. Computing the desired total order on S then corresponds
to topologically sorting G. As G is a planar (s, t)-graph of complexity Θ(N ),
this can be accomplished spending no more than O((N/B) logM/B (N/B))
I/Os [192].

Problem 6.15 (Trapezoidal Decomposition). Given a set S of N non-

intersecting segments in the plane, compute the planar partition induced by
extending a vertical ray in (+y)- and (−y)-direction from each endpoint p
of each segment until it hits the segment of S (if any) directly above resp.
below p (see Figure 6.12(b)).

While the Trapezoidal Decomposition problem is closely related to the

Endpoint Dominance problem (Problem 6.14) and to the Polygon Triangu-
lation problem (Problem 6.16), it is also of independent interest. In inter-
nal memory, computing the trapezoid decomposition as a preprocessing step
helps solving the Planar Point Location problem [457, 679] (Problem 6.18)
and performing map-overlay [304] (see Problem 6.22).
In the external memory setting, two algorithms are known for solving
the Trapezoidal Decomposition problem. The ﬁrst approach, proposed by
Arge et al. [70], exploits the simple fact that combining the results of two in-
stances of the Endpoint Dominance problem (one with negated y-coordinates
of all objects) yields the desired decomposition. All vertical extensions can be
computed explicitly by linearly scanning the output of both Endpoint Domi-
nance instances. The resulting extensions are then sorted by the name of the
original segment they lie on (ties are broken by x-coordinates), and during
one scan of the sorted output, all trapezoids can be reported in explicit form.
The second approach can be obtained within the framework of randomized
incremental construction with gradations as proposed by Crauser et al. [228].
Even if the segments in S are not intersection-free but induce Z intersec-
tions, the Trapezoidal Decomposition problem can be solved spending an
expected optimal number of O((N/B) logM/B (N/B) + Z/B) I/Os. The ba-
sic idea behind this framework is to externalize the paradigm of randomized
incremental construction (considering elements from the problem instance
6. External Memory Computational Geometry Revisited 137

one after the other, but in random order). Externalization is facilitated using
gradations (see, e.g., [566]), a concept originating in the design of parallel
algorithms. A gradation is a geometrically increasing random sequence of
subsets ∅ = S0 ⊆ · · · ⊆ S = S. The randomized incremental construction
with gradations reﬁnes the (intermediate) solution for a Si by simultaneously
adding all objects in Si+1 \Si (that is, in parallel respectively blockwise). This
framework is both general and powerful enough to yield algorithms with ex-
pected optimal complexity for a variety of geometric problems. As discussing
the sophisticated details and the analysis of the resulting algorithms would be
beyond the scope of this survey, we will only mention these results whenever
appropriate and instead refer the interested reader to the original article [228].

(a) Triangulation of a unimontone polygon. (b) Trapezoidal decomposition.

Fig. 6.13. Polygon triangulation and its relation to trapezoid decomposition.

Problem 6.16 (Polygon Triangulation). Given a simple polygon P in

the plane with N edges, partition the interior of P into N − 2 faces bounded
by three segments each by adding N − 3 non-intersecting line segments con-
necting two vertices of P (see Figure 6.13(a)).
Fornier and Montuno [310] proved that in the internal memory setting
the Polygon Triangulation problem is (linear-time) equivalent to the Trape-
zoidal Decomposition problem (Problem 6.15) applied to the interior of the
polygon (see Figure 6.13(b)). Subsequently, all internal memory algorithms
built upon this fact, culminating in an optimal linear-time algorithm by
Chazelle [180]. The main idea of computing a triangulation from a trape-
zoidal decomposition is to subdivide the original polygon into a collection of
unimonotone polygons. A simple polygon with vertices v1 , . . . , vN is called
unimonotone if there are vertices vi and vi+1 such that the projections of
vi+1 , . . . , vi+N onto the line supporting the edge (vi , vi+1 ) (all indices are to
be read modulo N ) form a sorted sequence. A unimonotone polygon can then
be triangulated by repeatedly cutting oﬀ convex corners during a stack-driven
traversal of the polygon’s boundary (see Figure 6.13(a)).
While the traversal of a polygon’s boundary can be done spending no more
than a linear number of I/Os, explicitly constructing the unimonotone poly-
gons is more involved. The key observation is that all necessary information
for subdividing a polygon into unimonotone polygons can be inferred locally,
138 Christian Breimann and Jan Vahrenhold

Fig. 6.14. Three classes of trapezoids.

i.e., by looking at isolated trapezoids. Each trapezoid is either a triangle or

it is determined by vertical lines originating from two polygon vertices (see
Figure 6.14). Fournier and Montuno [310] showed that by adding a diagonal
between every such pair of vertices that do not already form a polygon edge,
the polygon is partitioned into unimonotone polygons.
Arge et al. [70] built upon this observation and proposed the following al-
gorithm for computing a triangulation of the given polygon. First, the trape-
zoidal decomposition is computed and all resulting trapezoids are scanned to
see whether they induce diagonals as described above. For each vertex deter-
mining a qualifying trapezoid, a pointer to the matching vertex is stored. In
the second phase, the sequence of vertices on the boundary is transformed
into a linked list representing the vertices of the unimonotone subpolygons
as they appear in clockwise order on the respective boundaries.
Applying a list ranking algorithm (see Chapter 3) to this linked list yields
the sequence of vertices for each unimonotone subpolygon in sorted order.
The I/O-complexity of list ranking is O((N/B) logM/B (N/B)) [52, 192]. As
mentioned above, each subpolygon can then be triangulated spending a linear
number of I/Os. Summing up, we obtain an O((N/B) logM/B (N/B)) algo-
rithm for triangulating a simple polygon. As the internal memory complexity
of this problem is Θ(N ), a natural question is whether there exists an external
algorithm with matching O(N/B) I/O-complexity. At present, however, it is
unknown whether either the Trapezoidal Decomposition problem or the Poly-
gon Triangulation problem can be solved spending o((N/B) logM/B (N/B))
I/Os.

(a) Vertical ray shooting from point p. (b) Point location query for point p.

Fig. 6.15. Problems involving a single query point.

6. External Memory Computational Geometry Revisited 139

Problem 6.17 (Vertical Ray-Shooting). Given a set S of N non-inter-

secting segments in the plane and a query point p in the plane, ﬁnd the
segment in S (if any) ﬁrst hit by a ray emanating from p in (+y)-direction
(see Figure 6.15(a)).

The ﬁrst approach to external memory vertical ray-shooting has been

proposed by Goodrich et al. [345] for the special case of S forming a mono-
tone11 subdivision. Combining an on-line filtering technique with an ex-
ternal version of a fractional-cascaded data structure [182, 183], they ob-
tained a linear space external data structure that can be used to answer a
vertical ray-shooting query in O(logB N ) I/Os. Applying a batch filtering
technique, a batch of Q vertical ray-shooting queries can be answered in
O(((N + Q)/B) logM/B (N/B)) I/Os.
Arge et al. [70] extended this result to a set S forming a general pla-
nar subdivision. Their solution is based upon the observation that each of
the Q query points can be regarded as a (infinitesimally short) segment
and that solving the Endpoint Dominance problem (see Problem 6.14) for
the union of S and these Q segments yields the dominating segment for
each query point. Their solution, using an extended external segment tree,
requires O(((N + Q)/B) logM/B (N/B)) I/Os and O((N/B) logM/B (N/B))
space. Along similar lines, namely by reduction to the Trapezoidal Decompo-
sition problem (Problem 6.15), an algorithm can be derived in the framework
of Crauser et al. [228]. The resulting algorithm then answers a batch of Q
vertical ray-shooting queries in expected O(((N +Q)/B) logM/B (N/B)) I/Os
using linear space.
The Vertical Ray-Shooting problem can be stated in a dynamic version,
in which the set S additionally needs to be maintained under insertions and
deletions of segments. For algorithms building upon the assumption that S
forms a monotone subdivision Π, this implies that Π remains monotone after
each update.
The most successful internal memory approaches [95, 188] to the dynamic
version of the Vertical Ray-Shooting problem are based upon interval trees.
In contrast to the Segment Stabbing problem (Problem 6.12), however, one
cannot afford to report all segments above the query point p (there might
be Θ(N ) of them) just to find the one immediately above p. As a conse-
quence, the secondary data structures associated with the nodes of the base
interval tree have to reflect both the horizontal order of the endpoints within
the slab (if applicable) and the vertical ordering of the (left, right, and mid-
dle) segments. These requirements increase the complexity of dynamically
maintaining S under insertions and deletions.
For the left and right structures associated with each slab of a node in
the interval tree, Agarwal et al. [4] built upon ideas due to Cheng and Jar-
11
A polygon is called monotone in direction θ if any line in direction π/2 + θ inter-
sects the polygon in a connected interval. A planar subdivision Π is monotone
if all faces of Π are monotone in the same direction.
140 Christian Breimann and Jan Vahrenhold

nadan [188] and described a dynamic data structure for storing ν left and
right subsegments with O(logB ν) update time. Maintenance of the middle
segments is complicated by the fact that not all segments are comparable
according to the above-below relation (Problem 6.13), and that insertion of a
new segment might globally aﬀect the total order induced by this (local) par-
tial order. Using level-balanced B-trees (see Chapter 2) and exploiting special
properties of monotone subdivisions, Agarwal et al. [4] obtained a dynamic
data structure for storing ν middle subsegments with O(log2B ν) update time.
The global data structure uses linear space and can be used to answer a verti-
cal ray-shooting query in a monotone subdivision spending O(log2B N ) I/Os.
The amortized update complexity is O(log2B N ).
This result was improved by Arge and Vahrenhold [69] who applied the
logarithmic method (see Chapter 2) and an external variant of dynamic frac-
tional cascading [182, 183] to obtain the same update and query complexity
for general subdivisions.12 The analysis is based upon the (realistic) assump-
tion B 2 < M . Under the more restrictive assumption 2B < M , the amor-
tized insertion bound becomes O(logB N ·logM/B (N/B)) I/Os while all other
bounds remain the same.
A batched semidynamic version, that is, only deletions or only insertions
are allowed, and all updates have to be known in advance, has been proposed
by Arge et al. [65]. Using an external decomposition approach to the problem,
O(Q) point location queries and O(N ) updates can be performed in O(((N +
Q)/B) log2M/B ((N + Q)/B)) I/Os using O((N + Q)/B) space.

Problem 6.18 (Planar Point Location). Given a planar partition Π with

N edges and a query point P in the plane, ﬁnd the face of Π containing p
(see Figure 6.15(b)).

Usually, each edge in a planar partition stores the names of the two faces
of Π it separates. Then, algorithms for solving the Vertical Ray-Shooting
problem (Problem 6.17) can be used to answer point location queries with
constant additional work.
Most algorithms for vertical ray-shooting exploit hierarchical decomposi-
tions which can be generalized to a so-called trapezoidal search graph [680].
Using balanced hierarchical decompositions, searching then can be done ef-
ﬁciently in both the internal and external memory setting. As the query
points and thus the search paths to be followed are not known in advance,
external memory searching in such a graph will most likely result in unpre-
dictable access patterns and random I/O operations. The same is true for
using general-purpose tree-based spatial index structures.
It is well known that disk technologies and operating systems sup-
port sequential I/O operations more eﬃciently than random I/O opera-
tions [519, 739]. Additionally, for practical applications, it is often desirable
to trade asymptotically optimal performance for simpler structures if there is
12
The deletion bound can be improved to O(log B N ) I/Os amortized.
6. External Memory Computational Geometry Revisited 141

hope for comparable or even faster performance in practice. Vahrenhold and

Hinrichs [740] extended the bucketing technique of Edahiro, Kokubo, and
Asano [261] to the external memory setting incorporating both single-shot
and batched queries into a single algorithm. The resulting algorithm relies on
nothing more than sorting and scanning, and as the worst case I/O complex-
ity of O(N/B) I/Os for a single-shot query is obtained for only pathological
situations, the algorithm is both easy to implement and fast in practice [740].

(a) Bichromatic segment intersection. (b) General segment intersection.

Fig. 6.16. Segment intersection problems.

Problem 6.19 (Bichromatic Segment Intersection). Given a set S1 of

non-intersecting “blue” segments in the plane and a set S2 of non-intersecting
“red” segments in the plane with |S1 ∪ S2 | ∈ Θ(N ), compute all Z “red-blue”
pairs of intersecting segments in S1 × S2 (see Figure 6.16(a)).

To facilitate exposition of the algorithm for the Bichromatic Segment In-

tersection problem, we first describe an algorithm for solving the special case
of Orthogonal Segment Intersection, where we want to report all intersections
between a set S1 of horizontal segments and a set S2 of vertical segments.
To solve this problem, Goodrich et al. [345] described an optimal algorithm
with O((N/B) logM/B (N/B) + Z/B) I/O-complexity that is based upon the
distribution sweeping paradigm. For each slab, a so-called active list Ai is
maintained. If, during the top-down sweep, the upper (lower) endpoint of
a vertical segment is encountered, the segment is added to (removed from)
the active list of the slab it falls into. If a left endpoint of a segment s is
encountered, the intersection with each segment in the active lists of the
slabs spanned by s are reported. Intersections within slabs intersected but
not spanned by s are reported while sweeping at lower levels of recursion.
Using lazy deletions from the active lists and an amortization argument, the
method can be shown to require a linear number of I/Os per level of recur-
sion. As recursion stops as soon as the subproblem can be solved in main
memory, the overall I/O-complexity is O((N/B) logM/B (N/B) + Z/B) [345].
This approach has been refined by Arge et al. [70] to obtain an opti-
mal O((N/B) logM/B (N/B) + Z/B) algorithm for the Bichromatic Segment
Intersection problem. In a preprocessing step, the red segments and the end-
points of the blue segments (regarded as infinitesimal short segments) are
merged into one set and sorted according to the “above-below” relation. The
142 Christian Breimann and Jan Vahrenhold

same process is repeated for the set constructed from the blue segments and
the endpoints of the red segments. We now describe the work done on each
level of recursion during the distribution sweeping.
In the terminology of the description of the external interval tree (see
Problem 6.12), the algorithm first detects intersections between red middle
subsegments and blue left and right subsegments. The key to an efficient
solution is to explicitly construct the endpoints of the blue left and right
subsegments that lie on the slab boundaries and to merge them into the sorted
list of red middle subsegments and the (proper) endpoints of the blue left
and right subsegments. During a top-down sweep over the plane (in segment
order), blue left and right subsegments are then inserted into active lists
of their respective slab as soon as their topmost endpoint is encountered,
and for each red middle subsegment s encountered, the active lists of the
slabs spanned by s are scanned to produce red-blue pairs of intersecting
segments. As soon as a red middle subsegment does not intersect a blue left
or right subsegment, this blue segment cannot be intersected by any other red
segment, hence, it can be removed from the slab’s active list. An amortization
argument shows that all intersections can be reported in a linear number of
I/Os. An analogous scan is performed to report intersections between blue
middle subsegments and red left and right subsegments.
In a second phase, intersections between middle subsegments of different
colors are reported. For each multislab, a multislab list is created, and each
red middle subsegment is then distributed to the list of the maximal multislab
that it spans. An immediate consequence of the red segments being sorted
is that each multislab list is sorted by construction. Using a synchronized
traversal of the sorted list of blue middle subsegments and multislab lists
and repeating the process for the situation of the blue middle subsegments
being distributed, all red-blue pairs of intersecting middle subsegments can be
reported spending a linear number of I/Os. Intersections between non-middle
subsegments of different colors are found by recursion within the slabs. As
in the orthogonal setting, a linear number of I/Os is spent on each level
of recursion, hence, the overall I/O-complexity is O((N/B) logM/B (N/B) +
Z/B).
Since computing the trapezoidal decomposition of a set of segments yields
the Z intersections points without additional work, an algorithm with ex-
pected optimal O((N/B) logM/B (N/B) + Z/B) I/O-complexity can be de-
rived in the framework of Crauser et al. [228].

Problem 6.20 (Segment Intersection). Given a set S of N segments

in the plane, compute all Z pairs of intersecting segments in S × S (see
Figure 6.16(b)).

Even though the general Segment Intersection problem appears consider-

ably more complicated than its bichromatic variant, its intrinsic complexity is
the same [88, 181]. An external algorithm with (suboptimal) I/O-complexity
6. External Memory Computational Geometry Revisited 143

of O(((N + Z)/B) logM/B (N/B)) has been proposed by Arge et al. [70]. The
main idea is to integrate all phases of the deterministic solution described
for the Bichromatic Segment Intersection problem (see Problem 6.19) into
one single phase. The distribution sweeping paradigm is not directly ap-
plicable because there is no total order on a set of intersecting segments.
Arge et al. [70] proposed to construct an extended external segment tree on
the segments and (during the construction of this data structure) to break the
segments stored in the same multislab lists into non-intersecting fragments.
The resulting segment tree can then be used to detect intersections between
segments stored in different multislab lists. For details and the analysis of
this second phase, we refer the reader to the full version of the paper [70].
Since computing the trapezoidal decomposition of a set of segments yields
the Z intersections points without additional work, an algorithm with ex-
pected optimal O((N/B) logM/B (N/B) + Z/B) I/O-complexity can be de-
rived in the framework of Crauser et al. [228]. It remains a open problem,
though, to find a deterministic optimal solution for the Segment Intersection
problem.
Jagadish [426] developed a completely different approach to finding all
line segments that intersect a given line segment. Applying this algorithm
to all segments and removing duplicates, it can also be used to solve Prob-
lem 6.20. This algorithm, which has experimentally shown to perform well
for real-world data sets [426], partitions the d-dimensional data space into
d partitions (one for each axis) and stores a small amount of data for each
line segment in the partition with whose axis this line segment defines the
smallest angle. The data stored is determined by using a modified version of
Hough transform [408]. For simplicity, the planar case is considered here first,
before we show how to generalize it to higher dimensions. In the plane, each
line segment determines a line given by either y = m · x + b or x = m · y + b,
and at least one of these lines has a slope in [−1, 1]. This equation is taken
to map m and b to a point in (2-dimensional) transform space by a duality
transform. An intersection test for a given line segment works as follows. The
two endpoints are transformed into lines first. Assuming for simplicity, that
these lines intersect (the approach also works for parallel lines), we know that
these two lines divide the transform space into four regions. Transforming a
third point of the line segment, the two regions between the transformed lines
can be determined easily. The points contained in these regions (or, rather,
the segment supported by their dual lines) are candidates for intersecting
line segments. Whether they really intersect can be tested by comparing the
projections on the partition axis of both segments which have been stored
along with each point in transform space. For d-dimensional data space, only
little changes occur. After determining the partition axis, the projections of
each line segment on the d − 1 planes involving this axis are treated as above
resulting in d − 1 lines and a point in (2(d − 1))-dimensional transform space.
In addition, the interval of the projection on the partition axis is stored. Note
144 Christian Breimann and Jan Vahrenhold

that this technique needs 2dN space to store the line segments. Unfortunately,
no asymptotic bounds for query time are given, but experiments show that
this approach is more efficient than using spatial index structures or trans-
forming the two d-dimensional endpoints into one point in 2d-dimensional
data space, which are both very common approaches. Some other problems
including finding all line segments passing through or lying in the vicinity of
a specified point can be solved by this technique [426].
The Segment Intersection problem has a natural extension: Given a set
of polygonal objects, report all intersecting pairs of objects. While at first it
seems that this extension is quite straightforward, we will demonstrate in the
next section that only special cases can be solved efficiently.

6.5 Problems Involving Set of Polygonal Objects

In spatial databases that store sets of polygonal objects, combining two planar
partitions (maps) m1 and m2 by map overlay or spatial overlay join is an
important operation. The spatial overlay join of m1 and m2 produces a set
of pairs of polygonal objects (o1 , o2 ) where o1 ∈ m1 , o2 ∈ m2 , and o1 and o2
intersect. In contrast, the map overlay produces a set of polygonal objects
consisting of the following objects:
– All objects of m1 intersecting no object of m2
– All objects of m2 intersecting no object of m1
– All polygonal objects produced by two intersecting objects of m1 and m2
Usually spatial join operations are performed in two steps [594]:
– In the filter step, a conservative approximation of each spatial object is
used to eliminate objects that cannot be part of the result.
– In the refinement step, each pair of objects passing the filter step is exam-
ined according to the spatial join condition.
In the context of spatial join, the most common approximation for a spa-
tial object is by means of its minimum bounding box (see Section 6.2.3 and
Figure 6.17(a)). Performing the filter step then can be restated as finding
all pairs of intersecting rectangles between two sets of rectangles (see Fig-
ure 6.17(b)). The minimum bounding box is chosen from several possible
approximations as it realizes a good trade-off between approximation qual-
ity and storage requirements. In addition, most spatial index structures are
based on minimum bounding boxes. Nevertheless, some approaches use addi-
tional approximations to further reduce the number of pairs passing the filter
step [150].
Problem 6.21 (Rectangle Intersection). Given a set S of N axis-aligned
rectangles in the plane, compute all Z pairs of intersecting rectangles in S ×S.
In our definition, this includes pairs of rectangles where one rectangle is fully
contained in the other rectangle.
6. External Memory Computational Geometry Revisited 145

(a) Bounding boxes of road features (b) Bounding boxes of roads and
(Block Island, RI). hydrography features (Block Island).

Fig. 6.17. Using rectangular bounding boxes for a spatial join operation.

An eﬃcient approach based upon the distribution sweeping paradigm has

been proposed by Goodrich et al. [345] and later restated in the context
of bichromatic decomposable problems by Arge et al. [64, 65]. The main
observation is that, for any pair of intersecting rectangles, there exists a
horizontal line that passes through both rectangles, and thus their projections
onto this line consist of overlapping intervals. In Figure 6.18, three pairs of
intersecting rectangles and their projections onto the sweep-line are shown.

Fig. 6.18. Intersecting rectangles correspond to overlapping intervals.

The proposed algorithm for solving the Rectangle Intersection prob-

lem searches for intersections between active rectangles by exploiting this
rectangle–interval correspondence: As the sweep-line advances over the data,
the algorithm maintains the projections of all active rectangles onto the
sweep-line and checks which of these intervals intersect, thus reducing the
static two-dimensional rectangle intersection problem to the dynamic one-
dimensional interval intersection problem. During the top-down sweep,
Θ(M/B) multislab lists are maintained, and the (projected) intervals are
146 Christian Breimann and Jan Vahrenhold

first used to query the multislab lists for overlap with other intervals be-
fore the middle subsegments are inserted into the multislab lists themselves
(left and right subsegments are treated recursively).13 A middle subsegment
is removed from the multislab lists when the sweep-line passes the lower
boundary of the original rectangle. Making sure that these deletions are per-
formed in a blocked manner, one can show that the overall I/O-complexity
is O((N/B) logM/B (N/B) + Z/B). In the case of rectangles, the reduction
to finding intersection of edges yields an efficient algorithm as the number
of intersecting pairs of objects is asymptotically the same as the number of
intersecting pairs of edges—in the more general case of polygons, this is not
the case (see Problem 6.22).
In the database community, this problem is considered almost exclusively
in the bichromatic case of the filter step of spatial join operations, and several
heuristics implementing the filter step and performing well for real-world
data sets have been proposed during the last decade. Most of the proposed
algorithms [101, 150, 151, 362, 401, 414, 603, 704] need index structures for
both data sets, while others only require one data set to be indexed [63,
365, 511, 527], but also spatial hash joins [512] and other non index-based
algorithms [64, 477, 606] have been presented. Moreover, other conservative
approximation techniques besides minimum bounding boxes—mainly in the
planar case—like convex hull, minimum bounding m-corner (especially m ∈
{4, 5}), smallest enclosing circle, or smallest enclosing ellipse [149, 150] as well
as four-colors raster signature [782] have been considered. Using additional
progressive approximations like maximum enclosed circle or rectangle leads
to fast identification of object pairs which can be reported without testing
the exact geometry [149, 150]. Rotem [639] proposed to transform the idea
of join indices [741] to n-dimensional data space using the grid file [583].
The central idea behind all approaches summarized above is to repeat-
edly reduce the working set by pruning or partitioning until it fits into main
memory where an internal memory algorithm can be used. Most index-based
algorithms exploit the hierarchical representation implicitly given by the in-
dex structures to prune parts of the data sets that cannot contribute to the
output of the join operator. In contrast, algorithms for non-indexed spatial
join try to reduce the working set by either imposing an (artificial) order
and then performing some kind of merging according to this order or by
hashing the data to smaller partitions that can be treated separately. The
overall performance of algorithms for the filter step, however, often depends
on subtle design choices and characteristics of the data set [63], and therefore
discussing these approaches in sufficient detail would be beyond the scope of
this survey.
The refinement step of the spatial join cannot rely on approximations of
the polygonal objects but has to perform computations on the exact repre-
13
In the bichromatic setting, two sets of multislab lists are used, one for each color.
6. External Memory Computational Geometry Revisited 147

sentations of the objects that passed the ﬁlter step. In this step, the problem
is to determine all pairs of polygonal objects that fulﬁll the join predicate.

(a) Two simple polygons may have (b) Two convex polygons may have
Θ(N 2 ) intersecting pairs of edges. Θ(N ) intersecting pairs of edges.

Fig. 6.19. Output-sensitivity for intersecting polygons.

Problem 6.22 (Polygon Intersection). Given a set S of polygonal ob-

jects in the plane consisting of N edges in total, compute all Z pairs of
intersecting polygons in S × S. By definition this includes pairs of polygons
where one polygon is fully contained in the other polygon.
The main problem in developing efficient algorithms for the Polygon In-
tersection problem is the notion of output sensitivity. The problem, as stated
above, requires the output to depend on the number of intersecting pairs
of polygons and not on the number of intersecting pairs of polygon edges.
The problem could be easily solved by employing algorithms for the Segment
Intersection problem (Problem 6.20), however, the number of intersecting
pairs of edges can be asymptotically much larger than the number of in-
tersecting pairs of polygons. For simple polygons, each pair of intersecting
polygons can even give rise to a quadratic number of intersecting pairs of
edges (see Figure 6.19(a)). Even for convex polygons, any one pair of inter-
secting polygons can give rise to a linear number of intersecting pairs of edges
(see Figure 6.19(b)), but exploiting the convexity of the polygons, efficient
output-sensitive algorithms have been developed in the internal memory set-
ting [9, 364].
If the Polygon Intersection problem is considered in the context of the
bichromatic map overlay join, the output is no longer the set of intersecting
pairs of polygons, but it additionally includes the planar partition induced by
the overlay. This in turn removes the limitation of not to compute Z pairs of
intersecting segments. In the internal memory setting, an optimal algorithm
for computing the map overlay join of two simply connected planar partitions
in O(N + Z) time and space has been proposed by Finke and Hinrichs [304].
This algorithm heavily relies on a trapezoidal decomposition of the partitions
and on the ability to efficiently traverse a connected planar subdivision, so
unless both problems can be solved optimally in the external memory setting,
there is little hope for an optimal external memory variant.
148 Christian Breimann and Jan Vahrenhold

Some effort has also been made to combine spatial index structures and
internal memory algorithms ([174, 584]) for finding line segment intersec-
tions [100, 480], but these results rely on practical considerations about the
input data. Another approach which is also claimed to be efficient for real
world data sets [149], generates variants of R-trees, namely TR* -trees [671]
for both data sets and uses them to compute the result afterwards.

6.6 Conclusions

In this survey, we have discussed algorithms and data structures that can be
used for solving large-scale geometric problems. While a lot of research has
been done both in the context of spatial databases and in algorithmics, one
of the most challenging problems is to combine the best of these two worlds,
that is algorithmic design techniques and insights gained from experiments for
real-world instances. The ﬁeld of external memory experimental algorithmics
is still wide open.
Several important issues in large-scale Geographic Information Systems
have not been addressed in the context of external memory algorithms, in-
cluding how to externalize algorithms on triangulated irregular networks or
how to (I/O-eﬃciently) perform map-overlay on large digital maps. We con-
clude this chapter by stating two prominent open problems for which optimal
algorithms are known only in the internal memory setting:
– Is it possible to triangulate a simple polygon given its vertices in coun-
terclockwise order along its boundary spending only a linear number of
I/Os?
– Is it possible to compute all Z pairs of intersecting line segments in a set of
N line segments in the plane using a deterministic algorithm that spends
only O((N/B) logM/B (N/B) + Z/B) I/Os?
7. Full-Text Indexes in External Memory
Juha Kärkkäinen∗ and S. Srinivasa Rao

7.1 Introduction
A full-text index is a data structure storing a text (a string or a set of strings)
and supporting string matching queries: Given a pattern string P , find all
occurrences of P in the text. The best-known full-text index is the suffix
tree [761], but numerous others have been developed. Due to their fast con-
struction and the wealth of combinatorial information they reveal, full-text
indexes (and suffix trees in particular) also have many uses beyond basic
string matching. For example, the number of distinct substrings of a string
or the longest common substrings of two strings can be computed in lin-
ear time [231]. Gusfield [366] describes several applications in computational
biology, and many others are listed in [359].
Most of the work on full-text indexes has been done on the RAM model,
i.e., assuming that the text and the index fit into the internal memory. How-
ever, the size of digital libraries, biosequence databases and other textual
information collections often exceed the size of the main memory on most
computers. For example, the GenBank [107] database contains more than
20 GB of DNA sequences in its August 2002 release. Furthermore, the size
of a full-text index is usually 4–20 times larger than the size of the text it-
self [487]. Finally, if an index is needed only occasionally over a long period
of time, one has to keep it either in internal memory reducing the memory
available to other tasks or on disk requiring a costly loading into memory
every time it is needed.
In their standard form, full-text indexes have poor memory locality. This
has led to several recent results on adapting full-text indexes to external
memory. In this chapter, we review the recent work focusing on two issues,
full-text indexes supporting I/O-efficient string matching queries (and up-
dates), and external memory algorithms for constructing full-text indexes
(and for sorting strings, a closely related task).
We do not treat other string techniques in detail here. Most string
matching algorithms that do not use an index work by scanning the text
more or less sequentially (see, e.g., [231, 366]), and are relatively trivial to
adapt to an externally stored text. Worth mentioning, however, are algo-
rithms that may generate very large automata in pattern preprocessing, such
as [486, 533, 573, 735, 770], but we are not aware of external memory versions
of these algorithms.
∗
Partially supported by the Future and Emerging Technologies programme of the
EU under contract number IST-1999-14186 (ALCOM-FT).

U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 149-170, 2003.
© Springer-Verlag Berlin Heidelberg 2003
150 Juha Kärkkäinen and S. Srinivasa Rao

In information retrieval [85, 314], a common alternative to full-text in-

dexes is the inverted file [460], which takes advantage of the natural division
of linguistic texts into a limited number of distinct words. An inverted file
stores each distinct word together with a list of pointers to the occurrences
of the word in the text. The main advantage of inverted files is their space
requirement (about half of the size of the text [85]), but they cannot be used
with unstructured texts such as biosequences. Also, the space requirement of
the data structures described here can be significantly reduced when the text
is seen as a sequence of atomic words (see Section 7.3.2).
Finally, we mention another related string technique, compression. Two
recent developments are compressed indexes [299, 300, 361, 448, 648, 649] and
sequential string matching in compressed text without decompression [40,
289, 447, 575, 576]. Besides trying to fit the text or index into main memory,
these techniques can be useful for reducing the time for moving data from
disk to memory.

7.2 Preliminaries

We begin with a formal description of the problems and the model of com-
putation.
The Problems Let us define some terminology and notation. An alphabet
Σ is a finite ordered set of characters. A string S is an array of characters,
S[1, n] = S[1]S[2] . . . S[n]. For 1 ≤ i ≤ j ≤ n, S[i, j] = S[i] . . . S[j] is a
substring of S, S[1, j] is a prefix of S, and S[i, n] is a suffix of S. The set of
all strings over alphabet Σ is denoted by Σ ∗ .
The main problem considered here is the following.

Problem 7.1 (Indexed String Matching). Let the text T be a set of K

strings in Σ ∗ with a total length N . A string matching query on the text
is: Given a pattern P ∈ Σ ∗ , ﬁnd all occurrences of P as a substring of the
strings in T . The static problem is to store the text in a data structure, called
a full-text index, that supports string matching queries. The dynamic version
of the problem additionally requires support for insertion and deletion of
strings into/from T .

All the full-text indexes described here have a linear space complexity.
Therefore, the focus will be on the time complexity of queries and updates
(Section 7.4), and of construction (Section 7.5).
Additionally, the string sorting problem will be considered in Section 7.5.5.

Problem 7.2 (String Sorting). Given a set S of K strings in Σ ∗ with a

total length N , sort them into the lexicographic order.
7. Full-Text Indexes in External Memory 151

The Model Our computational model is the standard external memory

model introduced in [17, 755] and described in Chapter 1 of this volume. In
particular, we use the following main parameters:
N = number of characters in the text or in the strings to be sorted
M = number of characters that ﬁt into the internal memory
B = number of characters that ﬁt into a disk block
and the following shorthand notations:

scan(N ) = Θ (N/B)

sort(N ) = Θ (N/B) logM/B (N/B)
search(N ) = Θ (logB N )

The following parameters are additionally used:

K = number of strings in the text or in the set to be sorted
Z = size of the answer to a query (the number of occurrences)
|Σ| = size of the alphabet
|P | = number of characters in a pattern P
|S| = number of characters in an inserted/deleted string S
For simplicity, we mostly ignore the space complexity, the CPU com-
plexity, and the parallel (multiple disks) I/O complexity of the algorithms.
However, signiﬁcant deviations from optimality are noted.
With respect to string representation, we mostly assume the integer al-
phabet model, where characters are integers in the range {1, . . . , N }. Each
character occupies a single machine word, and all usual integer operations on
characters can be performed in constant time. For internal memory compu-
tation, we sometimes assume the constant alphabet model , which diﬀers from
the integer alphabet model in that dictionary operations on sets of characters
can be performed in constant time and linear space.1 Additionally, the packed
string model is discussed in 7.5.6.

7.3 Basic Techniques

In this section, we introduce some basic techniques. We start with the (for our
purposes) most important internal memory data structures and algorithms.
Then, we describe two external memory techniques that are used more than
once later.
1
With techniques such as hashing, this is nearly true even for the integer alphabet
model. However, integer dictionaries are a complex issue and outside the scope
of this article.
152 Juha Kärkkäinen and S. Srinivasa Rao
p t t
o e pot
a m
t empo
p
a t
t o
t t ato attoo
e
o o tery
r
o
y

Fig. 7.1. Trie and compact trie for the set {potato, pottery, tattoo, tempo}

7.3.1 Internal Memory Techniques

Most full-text indexes are variations of three data structures, suffix ar-
rays [340, 528], suffix trees [761] and DAWGs (Directed Acyclic Word
Graphs) [134, 230]. In this section, we describe suffix arrays and suffix trees,
which form the basis for the external memory data structures described here.
We are not aware of any adaptation of DAWG for external memory.
Let us start with an observation that underlies almost all full-text indexes.
If an occurrence of a pattern P starts at position i in a string S ∈ T , then P
is a prefix of the suffix S[i, |S|]. Therefore, we can find all occurrences of P
by performing a prefix search query on the set of all suffixes of the text: A
prefix search query asks for all the strings i