0% found this document useful (0 votes)

164 views714 pages

Querycompiler PDF

This document provides an outline for a book on building query compilers. It discusses database management system architecture, interpretation versus compilation of queries, requirements for a query compiler, and optimization techniques like join ordering and access path selection. The book is expected to take 5 years to complete and will cover topics like logical and physical query optimization, deterministic and probabilistic algorithms for join ordering, and foundations of logic and functional dependencies in databases.

Uploaded by

Shashank Shekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views714 pages

Querycompiler PDF

Uploaded by

Shashank Shekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Building Query Compilers

(Under Construction)
[expected time to completion: 5 years]

Guido Moerkotte

March 5, 2019
Contents

I Basics 3

1 Introduction 5
1.1 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 DBMS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Interpretation versus Compilation . . . . . . . . . . . . . . . . . . 6
1.4 Requirements for a Query Compiler . . . . . . . . . . . . . . . . 9
1.5 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Generation versus Transformation . . . . . . . . . . . . . . . . . 12
1.7 Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . 13

2 Textbook Query Optimization 15

2.1 Example Query and Outline . . . . . . . . . . . . . . . . . . . . . 15
2.2 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Canonical Translation . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Logical Query Optimization . . . . . . . . . . . . . . . . . . . . . 20
2.5 Physical Query Optimization . . . . . . . . . . . . . . . . . . . . 24
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Join Ordering 31
3.1 Queries Considered . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Query Graph . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Join Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Simple Cost Functions . . . . . . . . . . . . . . . . . . . . 34
3.1.4 Classification of Join Ordering Problems . . . . . . . . . . 40
3.1.5 Search Space Sizes . . . . . . . . . . . . . . . . . . . . . . 41
3.1.6 Problem Complexity . . . . . . . . . . . . . . . . . . . . . 45
3.2 Deterministic Algorithms . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Determining the Optimal Join Order in Polynomial Time 49
3.2.3 The Maximum-Value-Precedence Algorithm . . . . . . . . 56
3.2.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . 61
3.2.5 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.6 Join Ordering by Generating Permutations . . . . . . . . 79
3.2.7 A Dynamic Programming based Heuristics for Chain Queries 81
3.2.8 Transformation-Based Approaches . . . . . . . . . . . . . 94

i
ii CONTENTS

3.3 Probabilistic Algorithms . . . . . . . . . . . . . . . . . . . . . . . 101

3.3.1 Generating Random Left-Deep Join Trees with Cross Prod-
ucts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.3.2 Generating Random Join Trees with Cross Products . . . 103
3.3.3 Generating Random Join Trees without Cross Products . 107
3.3.4 Quick Pick . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.5 Iterative Improvement . . . . . . . . . . . . . . . . . . . . 116
3.3.6 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . 117
3.3.7 Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3.8 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . 119
3.4 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.4.1 Two Phase Optimization . . . . . . . . . . . . . . . . . . 122
3.4.2 AB-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 122
3.4.3 Toured Simulated Annealing . . . . . . . . . . . . . . . . 122
3.4.4 GOO-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.4.5 Iterative Dynamic Programming . . . . . . . . . . . . . . 123
3.5 Ordering Order-Preserving Joins . . . . . . . . . . . . . . . . . . 123
3.6 Characterizing Search Spaces . . . . . . . . . . . . . . . . . . . . 131
3.6.1 Complexity Thresholds . . . . . . . . . . . . . . . . . . . 131
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

4 Database Items, Building Blocks, and Access Paths 137

4.1 Disk Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.2 Database Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.3 Physical Database Organization . . . . . . . . . . . . . . . . . . . 147
4.4 Slotted Page and Tuple Identifier (TID) . . . . . . . . . . . . . . 150
4.5 Physical Record Layouts . . . . . . . . . . . . . . . . . . . . . . . 151
4.6 Physical Algebra (Iterator Concept) . . . . . . . . . . . . . . . . 152
4.7 Simple Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.8 Scan and Attribute Access . . . . . . . . . . . . . . . . . . . . . . 153
4.9 Temporal Relations . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.10 Table Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.11 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.12 Single Index Access Path . . . . . . . . . . . . . . . . . . . . . . 158
4.12.1 Simple Key, No Data Attributes . . . . . . . . . . . . . . 158
4.12.2 Complex Keys and Data Attributes . . . . . . . . . . . . 163
4.13 Multi Index Access Path . . . . . . . . . . . . . . . . . . . . . . . 165
4.14 Indexes and Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.15 Remarks on Access Path Generation . . . . . . . . . . . . . . . . 172
4.16 Counting the Number of Accesses . . . . . . . . . . . . . . . . . . 173
4.16.1 Counting the Number of Direct Accesses . . . . . . . . . . 173
4.16.2 Counting the Number of Sequential Accesses . . . . . . . 182
4.16.3 Pointers into the Literature . . . . . . . . . . . . . . . . . 187
4.17 Disk Drive Costs for N Uniform Accesses . . . . . . . . . . . . . 188
4.17.1 Number of Qualifying Cylinders, Tracks, and Sectors . . . 188
4.17.2 Command Costs . . . . . . . . . . . . . . . . . . . . . . . 189
CONTENTS iii

4.17.3 Seek Costs . . . . . . . . . . . . . . . . . . . . . . . . . . 189

4.17.4 Settle Costs . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.17.5 Rotational Delay Costs . . . . . . . . . . . . . . . . . . . 191
4.17.6 Head Switch Costs . . . . . . . . . . . . . . . . . . . . . . 193
4.17.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.18 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.19 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

II Foundations 197

5 Logic, Null, and Boolean Expressions 199

5.1 Two-valued logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.2 NULL values and three valued logic . . . . . . . . . . . . . . . . 199
5.3 Simplifying Boolean Expressions . . . . . . . . . . . . . . . . . . 200
5.4 Optimizing Boolean Expressions . . . . . . . . . . . . . . . . . . 200
5.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

6 Functional Dependencies 203

6.1 Functional Dependencies . . . . . . . . . . . . . . . . . . . . . . . 203
6.2 Functional Dependencies in the presence of NULL values . . . . . 204
6.3 Deriving Functional Dependencies over algebraic operators . . . . 204
6.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

7 An Algebra for Sets, Bags, and Sequences 205

7.1 Sets, Bags, and Sequences . . . . . . . . . . . . . . . . . . . . . . 205
7.1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.1.2 Duplicate Data: Bags . . . . . . . . . . . . . . . . . . . . 207
7.1.3 Explicit Duplicate Control . . . . . . . . . . . . . . . . . . 210
7.1.4 Ordered Data: Sequences . . . . . . . . . . . . . . . . . . 211
7.2 Aggregation Functions . . . . . . . . . . . . . . . . . . . . . . . . 212
7.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.3.2 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.3.3 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.3.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.3.5 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.3.6 Unary Grouping . . . . . . . . . . . . . . . . . . . . . . . 223
7.3.7 Unnest Operators . . . . . . . . . . . . . . . . . . . . . . 224
7.3.8 Flatten Operator . . . . . . . . . . . . . . . . . . . . . . . 225
7.3.9 Join Operators . . . . . . . . . . . . . . . . . . . . . . . . 225
7.3.10 Groupjoin . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.3.11 Min/Max Operators . . . . . . . . . . . . . . . . . . . . . 227
7.3.12 Other Dependent Operators . . . . . . . . . . . . . . . . . 228
7.4 Linearity of Algebraic Operators . . . . . . . . . . . . . . . . . . 229
7.4.1 Linearity of Algebraic Operators . . . . . . . . . . . . . . 229
7.4.2 Exploiting Linearity . . . . . . . . . . . . . . . . . . . . . 234
iv CONTENTS

7.5 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

7.5.1 Three Different Representations . . . . . . . . . . . . . . . 235
7.5.2 Conversion between Representations . . . . . . . . . . . . 237
7.5.3 Conversion between Bulk Types . . . . . . . . . . . . . . 237
7.5.4 Adjusting the Algebra . . . . . . . . . . . . . . . . . . . . 238
7.5.5 Partial Preaggregation . . . . . . . . . . . . . . . . . . . . 239
7.6 A Note on Equivalences . . . . . . . . . . . . . . . . . . . . . . . 239
7.7 Simple Reorderability . . . . . . . . . . . . . . . . . . . . . . . . 240
7.7.1 Unary Operators . . . . . . . . . . . . . . . . . . . . . . . 240
7.7.2 Push-Down/Pull-Up of Unary into/from Binary Operators 242
7.7.3 Binary Operators . . . . . . . . . . . . . . . . . . . . . . . 244
7.8 Predicate Detachment and Attachment . . . . . . . . . . . . . . . 249
7.9 Basic Equivalences for D-Join . . . . . . . . . . . . . . . . . . . . 251
7.10 Equivalences for Outerjoins . . . . . . . . . . . . . . . . . . . . . 253
7.10.1 Outerjoin Simplification . . . . . . . . . . . . . . . . . . . 259
7.10.2 Generalized Outerjoin . . . . . . . . . . . . . . . . . . . . 260
7.11 Equivalences for Unary Grouping . . . . . . . . . . . . . . . . . . 262
7.11.1 An Elementary Fact about Grouping . . . . . . . . . . . . 262
7.11.2 Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.11.3 Left Outerjoin . . . . . . . . . . . . . . . . . . . . . . . . 273
7.11.4 Left Outerjoin with Default . . . . . . . . . . . . . . . . . 276
7.11.5 Full Outerjoin . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.11.6 D-Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.11.7 Groupjoin . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.11.8 Intersection and Difference . . . . . . . . . . . . . . . . . 287
7.12 Eliminating Redundant Joins . . . . . . . . . . . . . . . . . . . . 287
7.13 Semijoin and Antijoin Reducer . . . . . . . . . . . . . . . . . . . 288
7.14 Outerjoin Simplification . . . . . . . . . . . . . . . . . . . . . . . 289
7.15 Correct and Complete Exploration of the Core Search Space . . . 289
7.15.1 The Core Search Space . . . . . . . . . . . . . . . . . . . 289
7.15.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.15.3 More Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.16 Logical Algebra for Sequences . . . . . . . . . . . . . . . . . . . . 303
7.16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 303
7.16.2 Algebraic Operators . . . . . . . . . . . . . . . . . . . . . 304
7.16.3 Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . 307
7.16.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 307
7.17 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
7.18 ToDo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

8 Declarative Query Representation 309

8.1 Calculus Representations . . . . . . . . . . . . . . . . . . . . . . 309
8.2 Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.3 Tableaux Representation . . . . . . . . . . . . . . . . . . . . . . . 309
8.4 Monoid Comprehension . . . . . . . . . . . . . . . . . . . . . . . 309
8.5 Expressiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
CONTENTS v

9 Translation and Lifting 311

9.1 Query Language to Calculus . . . . . . . . . . . . . . . . . . . . . 311
9.2 Query Language to Algebra . . . . . . . . . . . . . . . . . . . . . 311
9.3 Calculus to Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.4 Algebra to Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

10 Query Equivalence, Containment, Minimization, and Factor-

ization 313
10.1 Set Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
10.1.1 Conjunctive Queries . . . . . . . . . . . . . . . . . . . . . 314
10.1.2 . . . with Inequalities . . . . . . . . . . . . . . . . . . . . . 316
10.1.3 . . . with Negation . . . . . . . . . . . . . . . . . . . . . . . 317
10.1.4 . . . under Constraints . . . . . . . . . . . . . . . . . . . . 317
10.1.5 . . . with Aggregation . . . . . . . . . . . . . . . . . . . . . 317
10.2 Bag Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
10.2.1 Conjunctive Queries . . . . . . . . . . . . . . . . . . . . . 317
10.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
10.3.1 Path Expressions . . . . . . . . . . . . . . . . . . . . . . . 318
10.4 Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
10.5 Detecting common subexpressions . . . . . . . . . . . . . . . . . 319
10.5.1 Simple Expressions . . . . . . . . . . . . . . . . . . . . . . 319
10.5.2 Algebraic Expressions . . . . . . . . . . . . . . . . . . . . 319
10.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

III Rewrite Techniques 321

11 Simple Rewrites 323

11.1 Simple Adjustments . . . . . . . . . . . . . . . . . . . . . . . . . 323
11.1.1 Rewriting Simple Expressions . . . . . . . . . . . . . . . . 323
11.1.2 Normal forms for queries with disjunction . . . . . . . . . 325
11.2 Deriving new predicates . . . . . . . . . . . . . . . . . . . . . . . 325
11.2.1 Collecting conjunctive predicates . . . . . . . . . . . . . . 325
11.2.2 Equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
11.2.3 Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
11.2.4 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 327
11.2.5 ToDo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
11.3 Predicate Push-Down and Pull-Up . . . . . . . . . . . . . . . . . 329
11.4 Eliminating Redundant Joins . . . . . . . . . . . . . . . . . . . . 329
11.5 Distinct Pull-Up and Push-Down . . . . . . . . . . . . . . . . . . 329
11.6 Set-Valued Attributes . . . . . . . . . . . . . . . . . . . . . . . . 329
11.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 329
11.6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 330
11.6.3 Query Rewrite . . . . . . . . . . . . . . . . . . . . . . . . 331
11.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
vi CONTENTS

12 View Merging 335

12.1 View Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
12.2 Simple View Merging . . . . . . . . . . . . . . . . . . . . . . . . . 335
12.3 Predicate Move Around (Predicate pull-up and push-down) . . . 336
12.4 Complex View Merging . . . . . . . . . . . . . . . . . . . . . . . 337
12.4.1 Views with Distinct . . . . . . . . . . . . . . . . . . . . . 337
12.4.2 Views with Group-By and Aggregation . . . . . . . . . . 338
12.4.3 Views in IN predicates . . . . . . . . . . . . . . . . . . . . 339
12.4.4 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . 339
12.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

13 Quantifier treatment 341

13.1 Pseudo-Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.2 Existential quantifier . . . . . . . . . . . . . . . . . . . . . . . . . 342
13.3 Universal quantifier . . . . . . . . . . . . . . . . . . . . . . . . . . 342
13.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

14 Unnesting Nested Queries 347

15 Optimizing Queries with Materialized Views 349

15.1 Conjunctive Views . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.2 Views with Grouping and Aggregation . . . . . . . . . . . . . . . 349
15.3 Views with Disjunction . . . . . . . . . . . . . . . . . . . . . . . 349
15.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

16 Semantic Query Rewrite 351

16.1 Constraints and their impact on query optimization . . . . . . . 351
16.2 Semantic Query Rewrite . . . . . . . . . . . . . . . . . . . . . . . 351
16.3 Exploiting Uniqueness in Query Optimization . . . . . . . . . . . 352
16.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

IV Plan Generation 353

17 Current Search Space and Its Limits 355

17.1 Plans with Outer Joins, Semijoins and Antijoins . . . . . . . . . 355
17.2 Expensive Predicates and Functions . . . . . . . . . . . . . . . . 355
17.3 Techniques to Reduce the Search Space . . . . . . . . . . . . . . 355
17.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

18 Dynamic Programming-Based Plan Generation 357

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
18.2 Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
18.3 CCPs: Csg-Cmp-Pairs for Hypergraphs . . . . . . . . . . . . . . 359
18.4 Neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
18.5 The CCP Enumerator BuEnumCppHyp . . . . . . . . . . . . . . . . 361
18.5.1 BuEnumCcpHyp . . . . . . . . . . . . . . . . . . . . . . . 362
18.5.2 EnumerateCsgRec . . . . . . . . . . . . . . . . . . . . . . 363
CONTENTS vii

18.5.3 EmitCsg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

18.5.4 EnumerateCmpRec . . . . . . . . . . . . . . . . . . . . . . 365
18.5.5 EmitCsgCmp . . . . . . . . . . . . . . . . . . . . . . . . . 365
18.5.6 Neighborhood Calculation . . . . . . . . . . . . . . . . . . 365
18.6 DPhyp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
18.7 Adding Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
18.8 Adding Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
18.9 Adding Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

19 Optimizing Queries with Disjunctions 367

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
19.2 Using Disjunctive or Conjunctive Normal Forms . . . . . . . . . 368
19.3 Bypass Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
19.4 Implementation remarks . . . . . . . . . . . . . . . . . . . . . . . 370
19.5 Other plan generators/query optimizer . . . . . . . . . . . . . . . 370
19.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

20 Generating Plans for the Full Algebra 373

21 Generating DAG-structured Plans 375

22 Simplifying the Query Graph 377

22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
22.2 On Optimizing Join Queries . . . . . . . . . . . . . . . . . . . . . 378
22.3 Graph Simplification Algorithm . . . . . . . . . . . . . . . . . . . 379
22.3.1 Simplifying the Query Graph . . . . . . . . . . . . . . . . 380
22.3.2 The Full Algorithm . . . . . . . . . . . . . . . . . . . . . . 382
22.3.3 Join Ordering Criterion . . . . . . . . . . . . . . . . . . . 383
22.3.4 Theoretical Foundation . . . . . . . . . . . . . . . . . . . 384
22.4 The Time/Quality Trade-Off . . . . . . . . . . . . . . . . . . . . 386

23 Deriving and Dealing with Interesting Orderings and Group-

ings 389
23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
23.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 390
23.2.1 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
23.2.2 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
23.2.3 Functional Dependencies . . . . . . . . . . . . . . . . . . . 393
23.2.4 Algebraic Operators . . . . . . . . . . . . . . . . . . . . . 393
23.2.5 Plan Generation . . . . . . . . . . . . . . . . . . . . . . . 394
23.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
23.4 Detailed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 398
23.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
23.4.2 Determining the Input . . . . . . . . . . . . . . . . . . . . 399
23.4.3 Constructing the NFSM . . . . . . . . . . . . . . . . . . . 400
23.4.4 Constructing the DFSM . . . . . . . . . . . . . . . . . . . 403
23.4.5 Precomputing Values . . . . . . . . . . . . . . . . . . . . . 404
viii CONTENTS

23.4.6 During Plan Generation . . . . . . . . . . . . . . . . . . . 404

23.4.7 Reducing the Size of the NFSM . . . . . . . . . . . . . . . 404
23.4.8 Complex Ordering Requirements . . . . . . . . . . . . . . 408
23.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 409
23.6 Total Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
23.7 Influence of Groupings . . . . . . . . . . . . . . . . . . . . . . . . 411
23.8 Annotated Bibliography . . . . . . . . . . . . . . . . . . . . . . . 415

24 Cardinality and Cost Estimation 419

24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
24.2 A First Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
24.2.1 Top-Most Cost Formula (Overall Costs) . . . . . . . . . . 422
24.2.2 Summation of Operator Costs . . . . . . . . . . . . . . . . 422
24.2.3 CPU Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
24.2.4 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 423
24.2.5 I/O Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
24.2.6 Cardinality Estimates . . . . . . . . . . . . . . . . . . . . 425
24.3 A First Logical Profile and its Propagation . . . . . . . . . . . . 427
24.3.1 The Logical Profile . . . . . . . . . . . . . . . . . . . . . . 427
24.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 428
24.3.3 Profile Propagation for Selection . . . . . . . . . . . . . . 430
24.3.4 Profile Propagation for Join . . . . . . . . . . . . . . . . . 436
24.3.5 Profile Propagation for Projection . . . . . . . . . . . . . 437
24.3.6 Profile Propagation for Division . . . . . . . . . . . . . . . 441
24.3.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
24.4 Approximation of a Set of Values . . . . . . . . . . . . . . . . . . 443
24.4.1 Approximations and Error Metrics . . . . . . . . . . . . . 443
24.4.2 Example Applications . . . . . . . . . . . . . . . . . . . . 444
24.5 Approximation with Linear Models . . . . . . . . . . . . . . . . . 445
24.5.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . 445
24.5.2 Example Applications . . . . . . . . . . . . . . . . . . . . 449
24.5.3 Linear Models Under l2 . . . . . . . . . . . . . . . . . . . 456
24.5.4 Linear Models Under l∞ . . . . . . . . . . . . . . . . . . . 461
24.5.5 Linear Models Under lq . . . . . . . . . . . . . . . . . . . 464
24.5.6 Non-Linear Models under lq . . . . . . . . . . . . . . . . . 471
24.5.7 Multidimensional Models under lq . . . . . . . . . . . . . 472
24.6 Traditional Histograms . . . . . . . . . . . . . . . . . . . . . . . . 473
24.6.1 Bucketization . . . . . . . . . . . . . . . . . . . . . . . . . 474
24.6.2 Heuristics to Determine Bucket Boundaries . . . . . . . . 475
24.7 More on Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
24.7.1 Properties of the Q-Error . . . . . . . . . . . . . . . . . . 476
24.7.2 Properties of Estimation Functions . . . . . . . . . . . . . 484
24.7.3 θ,q-Acceptability . . . . . . . . . . . . . . . . . . . . . . . 485
24.7.4 Testing θ,q-Acceptability for Buckets . . . . . . . . . . . . 486
24.7.5 From Buckets To Histograms . . . . . . . . . . . . . . . . 489
24.7.6 Q-Compression . . . . . . . . . . . . . . . . . . . . . . . . 498
24.8 One Dimensional Synopses . . . . . . . . . . . . . . . . . . . . . . 501
CONTENTS ix

24.8.1 Four Level Tree and Variants . . . . . . . . . . . . . . . . 501

24.8.2 Q-Histograms (Type I) . . . . . . . . . . . . . . . . . . . . 504
24.8.3 Q-Histogram (Type II) . . . . . . . . . . . . . . . . . . . . 504
24.9 Sketches For Counting The Number of Distinct Values . . . . . . 504
24.9.1 Linear Counting . . . . . . . . . . . . . . . . . . . . . . . 506
24.9.2 DvByKMinVal . . . . . . . . . . . . . . . . . . . . . . . . 506
24.9.3 Logarithmic Counting . . . . . . . . . . . . . . . . . . . . 507
24.9.4 SuperLogLog Counting . . . . . . . . . . . . . . . . . . . 508
24.9.5 HyperLogLog Counting . . . . . . . . . . . . . . . . . . . 511
24.9.6 DvByMinAvg . . . . . . . . . . . . . . . . . . . . . . . . . 511
24.9.7 DvByKMinAvg . . . . . . . . . . . . . . . . . . . . . . . . 512
24.9.8 Pointers to the Literature . . . . . . . . . . . . . . . . . . 513
24.10Multidimensional Synopsis . . . . . . . . . . . . . . . . . . . . . . 513
24.10.1 Introductory Example . . . . . . . . . . . . . . . . . . . . 514
24.10.2 Solving the Introductory Problem without 2-Dimensional
Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
24.10.3 Statistical Views . . . . . . . . . . . . . . . . . . . . . . . 516
24.10.4 Regular Partitioning: equi-width . . . . . . . . . . . . . . 517
24.10.5 Equi-Depth Histogram . . . . . . . . . . . . . . . . . . . . 517
24.10.6 2-Dimensional Synopsis based on SVD . . . . . . . . . . . 517
24.10.7 PHASED . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
24.10.8 MHIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
24.10.9 GENHIST . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
24.10.10HiRed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
24.10.11VI Histograms . . . . . . . . . . . . . . . . . . . . . . . . 517
24.10.12Grid Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
24.10.13More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
24.11Iterative Selectivity Combination . . . . . . . . . . . . . . . . . . 518
24.12Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 518
24.13Selected Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
24.13.1 Exploiting and Augmenting Existing DBMS Data Struc-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
24.13.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
24.13.3 Query Feedback . . . . . . . . . . . . . . . . . . . . . . . 522
24.13.4 Combining Data Summaries with Sampling . . . . . . . . 522
24.13.5 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
24.13.6 Selectivity of String-Valued Attributes . . . . . . . . . . . 522
24.14Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
24.14.1 Disk-based Joins . . . . . . . . . . . . . . . . . . . . . . . 522
24.14.2 Main Memory Joins . . . . . . . . . . . . . . . . . . . . . 522
24.14.3 Additional Pointers to the Literature . . . . . . . . . . . . 522

V Implementation 525

25 Architecture of a Query Compiler 527

25.1 Compilation process . . . . . . . . . . . . . . . . . . . . . . . . . 527
x CONTENTS

25.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527

25.3 Control Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
25.4 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . 529
25.5 Tracing and Plan Visualization . . . . . . . . . . . . . . . . . . . 529
25.6 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
25.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

26 Internal Representations 533

26.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
26.2 Algebraic Representations . . . . . . . . . . . . . . . . . . . . . . 533
26.2.1 Graph Representations . . . . . . . . . . . . . . . . . . . . 534
26.2.2 Query Graph . . . . . . . . . . . . . . . . . . . . . . . . . 534
26.2.3 Operator Graph . . . . . . . . . . . . . . . . . . . . . . . 534
26.3 Query Graph Model (QGM) . . . . . . . . . . . . . . . . . . . . . 534
26.4 Classification of Predicates . . . . . . . . . . . . . . . . . . . . . 534
26.5 Treatment of Distinct . . . . . . . . . . . . . . . . . . . . . . . . 534
26.6 Query Analysis and Materialization of Analysis Results . . . . . 534
26.7 Query and Plan Properties . . . . . . . . . . . . . . . . . . . . . 535
26.8 Conversion to the Internal Representation . . . . . . . . . . . . . 537
26.8.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 537
26.8.2 Translation into the Internal Representation . . . . . . . . 537
26.9 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537

27 Details on the Phases of Query Compilation 539

27.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
27.2 Semantic Analysis, Normalization, Factorization, Constant Fold-
ing, and Translation . . . . . . . . . . . . . . . . . . . . . . . . . 539
27.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
27.4 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
27.5 Constant Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
27.6 Semantic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
27.7 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
27.8 Rewrite I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
27.9 Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
27.10Rewrite II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
27.11Code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
27.12Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550

28 Hard-Wired Algorithms 551

28.1 Hard-wired Dynamic Programming . . . . . . . . . . . . . . . . . 551
28.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 551
28.1.2 A plan generator for bushy trees . . . . . . . . . . . . . . 555
28.1.3 A plan generator for bushy trees and expensive selections 556
28.1.4 A plan generator for bushy trees, expensive selections and
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
28.2 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
CONTENTS xi

29 Rule-Based Algorithms 559

29.1 Rule-based Dynamic Programming . . . . . . . . . . . . . . . . . 559
29.2 Rule-based Memoization . . . . . . . . . . . . . . . . . . . . . . . 559
29.3 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559

30 Example Query Compiler 561

30.1 Research Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . 561
30.1.1 AQUA and COLA . . . . . . . . . . . . . . . . . . . . . . 561
30.1.2 Black Dahlia II . . . . . . . . . . . . . . . . . . . . . . . . 561
30.1.3 Epoq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
30.1.4 Ereq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
30.1.5 Exodus/Volcano/Cascade . . . . . . . . . . . . . . . . . . 564
30.1.6 Freytags regelbasierte System R-Emulation . . . . . . . . 566
30.1.7 Genesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
30.1.8 GOMbgo . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
30.1.9 Gral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
30.1.10 Lambda-DB . . . . . . . . . . . . . . . . . . . . . . . . . . 575
30.1.11 Lanzelotte in short . . . . . . . . . . . . . . . . . . . . . . 575
30.1.12 Opt++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
30.1.13 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
30.1.14 Sciore & Sieg . . . . . . . . . . . . . . . . . . . . . . . . . 578
30.1.15 Secondo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
30.1.16 Squiral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
30.1.17 System R and System R∗ . . . . . . . . . . . . . . . . . . 580
30.1.18 Starburst and DB2 . . . . . . . . . . . . . . . . . . . . . . 580
30.1.19 Der Optimierer von Straube . . . . . . . . . . . . . . . . . 583
30.1.20 Other Query Optimizer . . . . . . . . . . . . . . . . . . . 584
30.2 Commercial Query Compiler . . . . . . . . . . . . . . . . . . . . 586
30.2.1 The DB 2 Query Compiler . . . . . . . . . . . . . . . . . 586
30.2.2 The Oracle Query Compiler . . . . . . . . . . . . . . . . . 586
30.2.3 The SQL Server Query Compiler . . . . . . . . . . . . . . 590

VI Selected Topics 591

31 Generating Plans for Top-N-Queries? 593

31.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . 593
31.2 Optimizing for the First Tuple . . . . . . . . . . . . . . . . . . . 593
31.3 Optimizing for the First N Tuples . . . . . . . . . . . . . . . . . . 593

32 Recursive Queries 595

33 Issues Introduced by OQL 597

33.1 Type-Based Rewriting and Pointer Chasing Elimination . . . . . 597
33.2 Class Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
33.3 Cardinalities and Cost Functions . . . . . . . . . . . . . . . . . . 601
xii CONTENTS

34 Issues Introduced by XPath 603

34.1 A Naive XPath-Interpreter and its Problems . . . . . . . . . . . 603
34.2 Dynamic Programming and Memoization . . . . . . . . . . . . . 603
34.3 Naive Translation of XPath to Algebra . . . . . . . . . . . . . . . 603
34.4 Pushing Duplicate Elimination . . . . . . . . . . . . . . . . . . . 603
34.5 Avoiding Duplicate Work . . . . . . . . . . . . . . . . . . . . . . 603
34.6 Avoiding Duplicate Generation . . . . . . . . . . . . . . . . . . . 603
34.7 Index Usage and Materialized Views . . . . . . . . . . . . . . . . 603
34.8 Cardinalities and Costs . . . . . . . . . . . . . . . . . . . . . . . 603
34.9 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603

35 Issues Introduced by XQuery 605

35.1 Reordering in Ordered Context . . . . . . . . . . . . . . . . . . . 605
35.2 Result Construction . . . . . . . . . . . . . . . . . . . . . . . . . 605
35.3 Unnesting Nested XQueries . . . . . . . . . . . . . . . . . . . . . 605
35.4 Cardinalities and Cost Functions . . . . . . . . . . . . . . . . . . 605
35.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605

36 Outlook 607

A Query Languages? 609

A.1 Designing a query language . . . . . . . . . . . . . . . . . . . . . 609
A.2 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.3 OQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.4 XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.5 XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.6 Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609

B Query Execution Engine (?) 611

C Glossary of Rewrite and Optimization Techniques 613

D Useful Formulas 619

Bibliography 620

Index 688

E ToDo 689
List of Figures

1.1 DBMS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Query interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Simple query interpreter . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Query compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Query compiler architecture . . . . . . . . . . . . . . . . . . . . . 8
1.6 Execution plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Potential and actual search space . . . . . . . . . . . . . . . . . . 12
1.8 Generation vs. transformation . . . . . . . . . . . . . . . . . . . . 13

2.1 Relational algebra . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Equivalences for the relational algebra . . . . . . . . . . . . . . . 18
2.3 (Simplified) Canonical translation of SQL to algebra . . . . . . . 19
2.4 Text book query optimization . . . . . . . . . . . . . . . . . . . . 20
2.5 Logical query optimization . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Different join trees . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Plans for example query (Part I) . . . . . . . . . . . . . . . . . . 28
2.8 Plans for example query (Part II) . . . . . . . . . . . . . . . . . . 29
2.9 Physical query optimization . . . . . . . . . . . . . . . . . . . . . 30
2.10 Plan for example query after physical query optimization . . . . 30

3.1 Query graph for example query of Section 2.1 . . . . . . . . . . . 32

3.2 Query graph shapes . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Illustrations for the IKKBZ Algorithm . . . . . . . . . . . . . . . 55
3.4 A query graph, its directed join graph, some spanning trees and
join trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 A query graph, its directed join tree, a spanning tree and its
problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Search space with sharing under optimality principle . . . . . . . 63
3.7 Algorithm DPsize . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.8 Algorithm DPsub . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.9 Size of the search space for different graph structures . . . . . . . 74
3.10 Algorithm DPccp . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.11 Enumeration Example for DPccp . . . . . . . . . . . . . . . . . . 75
3.12 Sample graph to illustrate EnumerateCsgRec . . . . . . . . . . . 77
3.13 Call sequence for Figure 3.12 . . . . . . . . . . . . . . . . . . . . 77
3.14 Example of rule transformations (RS-1) . . . . . . . . . . . . . . 99

xiii
xiv LIST OF FIGURES

3.15 Encoding Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.16 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.17 Tree-merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.18 Algorithm UnrankDecomposition . . . . . . . . . . . . . . . . . . 110
3.19 Leaf-insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.20 A query graph, its tree, and its standard decomposition graph . . 111
3.21 Algorithm Adorn . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.22 A query graph, a join tree, and its encoding . . . . . . . . . . . . 121
3.23 Pseudo code for IDP-1 . . . . . . . . . . . . . . . . . . . . . . . . 124
3.24 Pseudocode for IDP-2 . . . . . . . . . . . . . . . . . . . . . . . . 125
3.25 Subroutine applicable-predicates . . . . . . . . . . . . . . . . 127
3.26 Subroutine construct-bushy-tree . . . . . . . . . . . . . . . . . 128
3.27 Subroutine extract-plan and its subroutine . . . . . . . . . . . 128
3.28 Impact of selectivity on the search space . . . . . . . . . . . . . . 134
3.29 Impact of relation sizes on the search space . . . . . . . . . . . . 134
3.30 Impact of parameters on the performance of heuristics . . . . . . 134
3.31 Impact of selectivities on probabilistic procedures . . . . . . . . . 135

4.1 Disk drive assembly . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.2 Disk drive read request processing . . . . . . . . . . . . . . . . . 139
4.3 Time to read 100 MB from disk (depending on the number of
8 KB blocks read at once) . . . . . . . . . . . . . . . . . . . . . . 142
4.4 Time needed to read n random pages . . . . . . . . . . . . . . . . 144
4.5 Break-even point in fraction of total pages depending on page size145
4.6 Physical organization of a relational database . . . . . . . . . . . 148
4.7 Slotted pages and TIDs . . . . . . . . . . . . . . . . . . . . . . . 150
4.8 Various physical record layouts . . . . . . . . . . . . . . . . . . . 151
4.9 Clustered vs. non-clustered index . . . . . . . . . . . . . . . . . . 157
4.10 Illustration of seek cost estimate . . . . . . . . . . . . . . . . . . 190

5.1 Truth tables for two-valued logic . . . . . . . . . . . . . . . . . . 199

5.2 Simplification Rules . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.3 Laws for two-valued logic . . . . . . . . . . . . . . . . . . . . . . 201
5.4 Comparison functions in the presence of NULL values . . . . . . 201
5.5 Truth tables for three-valued logic . . . . . . . . . . . . . . . . . 202
5.6 Interpretation of ⊥ values . . . . . . . . . . . . . . . . . . . . . . 202

7.1 Laws for Set Operations . . . . . . . . . . . . . . . . . . . . . . . 206

7.2 Laws for Bag Operations . . . . . . . . . . . . . . . . . . . . . . . 208
7.3 Decomposition of aggregate functions . . . . . . . . . . . . . . . . 215
7.4 Example for map and group operators . . . . . . . . . . . . . . . 222
7.5 Three possible representations of a bag . . . . . . . . . . . . . . . 236
7.6 Example for outerjoin reorderability (for strict q) . . . . . . . . . 253
7.7 Example for outerjoin reorderability (for non-strict q 0 ) . . . . . . 254
7.8 Example for outerjoin reorderability (for partially non-strict q0) . 254
7.9 Example for outerjoin associativity for strict q . . . . . . . . . . . 255
7.10 Example for outerjoin associativity for non-strict q 0 . . . . . . . . 256
LIST OF FIGURES xv

7.11 Example for outerjoin l-asscom for strict q . . . . . . . . . . . . . 256

7.12 Example for grouping and join . . . . . . . . . . . . . . . . . . . 264
7.13 Extended example for grouping and join . . . . . . . . . . . . . . 265
7.14 Example for Eqv. 7.113 . . . . . . . . . . . . . . . . . . . . . . . 270
7.15 Example relations . . . . . . . . . . . . . . . . . . . . . . . . . . 285
7.16 Join results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
7.17 Left- and right-hand sides . . . . . . . . . . . . . . . . . . . . . . 286
7.18 Transformation rules for assoc, l-asscom, and r-asscom . . . . . . 289
7.19 Core search space example . . . . . . . . . . . . . . . . . . . . . . 290
7.20 The complete search space . . . . . . . . . . . . . . . . . . . . . . 291
7.21 Algorithm DPsube . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.22 Calculating TES for simple operator trees . . . . . . . . . . . . . 295
7.23 Example showing the incompleteness of CD-A . . . . . . . . . . . 296
7.24 Calculating conflict rules for simple operator trees . . . . . . . . 297
7.25 Example showing the incompleteness of CD-B . . . . . . . . . . . 298
7.26 Conflict detection for unary and binary operators . . . . . . . . . 300
7.27 Example for Map Operator . . . . . . . . . . . . . . . . . . . . . 304
7.28 Examples for unary grouping and the groupjoin . . . . . . . . . . 306

11.1 Simplification rules for boolean expressions . . . . . . . . . . . . 326

11.2 Axioms for equality . . . . . . . . . . . . . . . . . . . . . . . . . . 326
11.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
11.4 Axioms for inequality . . . . . . . . . . . . . . . . . . . . . . . . 334

18.1 Sample hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . 358

18.2 Trace of algorithm for Figure ?? . . . . . . . . . . . . . . . . . . 363
18.3 Pseudocode for calcNeighborhood . . . . . . . . . . . . . . . . . 366

19.1 DNF plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

19.2 CNF plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
19.3 Bypass plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

22.1 Runtimes for Different Query Graphs . . . . . . . . . . . . . . . . 379

22.2 Exemplary Simplification Steps for a Star Query . . . . . . . . . 380
22.3 Pseudo-Code for a Single Simplification Step . . . . . . . . . . . 381
22.4 The Full Optimization Algorithm . . . . . . . . . . . . . . . . . . 383
22.5 The Effect of Simplification Steps for a Star Query with 20 Re-
lations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
22.6 The Effect of Simplification Steps for a Grid Query with 20 Re-
lations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388

23.1 Propagation of orderings and groupings . . . . . . . . . . . . . . 394

23.2 Possible FSM for orderings . . . . . . . . . . . . . . . . . . . . . 396
23.3 Possible FSM for groupings . . . . . . . . . . . . . . . . . . . . . 397
23.4 Combined FSM for orderings and groupings . . . . . . . . . . . . 397
23.5 Possible DFSM for Figure 23.4 . . . . . . . . . . . . . . . . . . . 397
23.6 Preparation steps of the algorithm . . . . . . . . . . . . . . . . . 399
23.7 Initial NFSM for sample query . . . . . . . . . . . . . . . . . . . 401
xvi LIST OF FIGURES

23.8 NFSM after adding DF D edges . . . . . . . . . . . . . . . . . . . 402

23.9 NFSM after pruning artificial states . . . . . . . . . . . . . . . . 402
23.10Final NFSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
23.11Resulting DFSM . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
23.12contains Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
23.13transition Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
23.14Plan generation for different join graphs, Simmen’s algorithm
(left) vs. our algorithm (middle) . . . . . . . . . . . . . . . . . . 409
23.15Memory consumption in KB for Figure 23.14 . . . . . . . . . . . 411
23.16Time requirements for the preparation step . . . . . . . . . . . . 414
23.17Space requirements for the preparation step . . . . . . . . . . . . 415

24.1 Overview of operations for cardinality and cost estimations . . . 420

24.2 Sample for range query result estimation under CVA and ESA. . 431
24.3 Calculating the lower bound DG ⊥ . . . . . . . . . . . . . . . . . . 440
24.4 Calculating the estimate for DG . . . . . . . . . . . . . . . . . . . 441
24.5 Example frequency density and cumulated frequency . . . . . . . 451
24.6 Cumulated frequency and its approximation . . . . . . . . . . . . 452
24.7 Q-error and plan optimality . . . . . . . . . . . . . . . . . . . . . 456
24.8 Algorithm for best linear approximation under l∞ . . . . . . . . 465
24.9 Algorithm finding best linear approximation under lq . . . . . . . 470
24.10Sample data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
24.11Q-compression, logb -based . . . . . . . . . . . . . . . . . . . . . . 498
24.12Binary Q-compression . . . . . . . . . . . . . . . . . . . . . . . . 500
24.13FLT example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
24.14FLT example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
24.15Car database example . . . . . . . . . . . . . . . . . . . . . . . . 506
24.16Linear Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
24.17Algorithm DvByKMinVal . . . . . . . . . . . . . . . . . . . . . . 507
24.18Algorithm LogarithmicCounting . . . . . . . . . . . . . . . . . . . 508
24.19Algorithm PCSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
24.20Filling M for LogLogCounting, SuperLogLogCounting, and Hy-
perLogLogCounting . . . . . . . . . . . . . . . . . . . . . . . . . 510
24.21SuperLogLog Counting . . . . . . . . . . . . . . . . . . . . . . . . 510
24.22Calculation of α̃ . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
24.23HyperLogLog Counting . . . . . . . . . . . . . . . . . . . . . . . 512
24.24DvByMinAvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
24.25DvByKMinAvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
24.26Example for Equi-Depth Tree . . . . . . . . . . . . . . . . . . . . 517
24.27Sample B+ -Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 519

25.1 The compilation process . . . . . . . . . . . . . . . . . . . . . . . 528

25.2 Class Architecture of the Query Compiler . . . . . . . . . . . . . 530
25.3 Control Block Structure . . . . . . . . . . . . . . . . . . . . . . . 531

27.1 Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540

27.2 Expression hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 541
LIST OF FIGURES xvii

27.3 Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542

27.4 Query 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
27.5 Internal representation . . . . . . . . . . . . . . . . . . . . . . . . 545
27.6 An algebraic operator tree with a d-join . . . . . . . . . . . . . . 548
27.7 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

28.1 A sample execution plan . . . . . . . . . . . . . . . . . . . . . . . 552

28.2 Different join operator trees . . . . . . . . . . . . . . . . . . . . . 553
28.3 Bottom up plan generation . . . . . . . . . . . . . . . . . . . . . 555
28.4 A Dynamic Programming Optimization Algorithm . . . . . . . . 557

30.1 Beispiel einer Epoq-Architektur . . . . . . . . . . . . . . . . . . . 562

30.2 Exodus Optimierer Generator . . . . . . . . . . . . . . . . . . . . 564
30.3 Organisation der Optimierung . . . . . . . . . . . . . . . . . . . . 567
30.4 Ablauf der Optimierung . . . . . . . . . . . . . . . . . . . . . . . 570
30.5 Architektur von GOMrbo . . . . . . . . . . . . . . . . . . . . . . 571
30.6 a) Architektur des Gral-Optimierers; b) Operatorhierarchie nach
Kosten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
30.7 Die Squiralarchitektur . . . . . . . . . . . . . . . . . . . . . . . . 579
30.8 Starburst Optimierer . . . . . . . . . . . . . . . . . . . . . . . . . 581
30.9 Der Optimierer von Straube . . . . . . . . . . . . . . . . . . . . . 583

33.1 Algebraic representation of a query . . . . . . . . . . . . . . . . . 597

33.2 A join replacing pointer chasing . . . . . . . . . . . . . . . . . . . 599
33.3 A Sample Class Hierarchy . . . . . . . . . . . . . . . . . . . . . . 600
33.4 Implementation of Extents . . . . . . . . . . . . . . . . . . . . . . 601
xviii LIST OF FIGURES
Preface

Goals
Primary Goals:

• book covers many query languages (at least SQL, OQL, XQuery (XPath))

• techniques should be represented as query language independent as pos-

sible

• book covers all stages of the query compilation process

• book completely covers fundamental issues

• book gives implementation details and tricks

Secondary Goals:

• book is thin

• book is not totally unreadable

• book separates concepts from implementation techniques

Organizing the material is not easy: The same topic pops up

• at different stages of the query compilation process and

• at different query languages

Acknowledgements
Introducer to query optimization: Günther von Bültzingsloewen
Peter Lockemann
First paper coauthor: Stefan Karl,
Coworkers: Alfons Kemper, Klaus Peithner, Michael Steinbrunn, Donald
Kossmann, Carsten Gerlhof, Jens Claussen,
Sophie Cluet, Vassilis Christophides, Georg Gottlob, V.S. Subramanian,
Sven Helmer, Birgitta König-Ries, Wolfgang Scheufele, Carl-Christian Kanne,
Thomas Neumann, Norman May, Matthias Brantner
Robin Aly

xix
LIST OF FIGURES 1

Discussions: Umesh Dayal, Dave Maier, Gail Mitchell, Stan Zdonik, Tamer
Özsu, Arne Rosenthal,
Don Chamberlin, Bruce Lindsay, Guy Lohman, Mike Carey, Bennet Vance,
Laura Haas, Mohan, CM Park,
Yannis Ioannidis, Götz Graefe, Serge Abiteboul, Claude Delobel Patrick
Valduriez, Dana Florescu, Jerome Simeon, Mary Fernandez, Christoph Koch,
Adam Bosworth, Joe Hellerstein, Paul Larson, Hennie Steenhagen, Harald
Schöning, Bernhard Seeger,
Encouragement: Anand Deshpande
Manuscript: Simone Seeger,
and many others to be inserted.
2 LIST OF FIGURES
Part I

Basics

3
Chapter 1

Introduction

1.1 General Remarks

Query languages like SQL or OQL are declarative. That is, they specify what
the user wants to retrieve and not how to retrieve it. It is the task of the
query compiler to generate a query evaluation plan (evaluation plan for short,
or execution plan or simply plan) for a given query. The query evaluation plan
(QEP) essentially is an operator tree with physical algebraic operators as nodes.
It is evaluated by the runtime system. Figure 1.6 shows a detailed execution
plan ready to be interpreted by the runtime system. Figure 28.1 shows an
abstraction of a query plan often used to explain algorithms or optimization
techniques.
The book tries to demystify query optimization and query optimizers. By
means of the multi-lingual query optimizer BD II, the most important aspects
of query optimizers and their implementation are discussed. We concentrate
not only on the query optimizer core (Rewrite I, Plan Generator, Rewrite II)
of the query compilation process but touch on all issues from parsing to code
generation and quality assurance.
We start by giving a two-module overview of a database management sys-
tem. One of these modules covers the functionality of the query compiler.
The query compiler itself involves several submodules. For each submodule we
discuss the features relevant for query compilation.

1.2 DBMS Architecture

Any database management system (DBMS) can be divided into two major
parts: the compile time system (CTS) and the runtime system (RTS). User
commands enter the compile time system which translates them into executa-
bles which are then interpreted by the runtime system (Fig. 1.1).
The input to the CTS are statements of several kinds, for example connect
to a database (starts a session), disconnect from a database, create a database,
drop a database, add/drop a schema, perform schema changes (add relations,
object types, constraints, . . . ), add/drop indexes, run statistics commands,
manually modify the DBMS statistics, begin of a transaction, end of a transac-

5
6 CHAPTER 1. INTRODUCTION

user command (e.g. query)

CTS
execution plan

RTS
result

Figure 1.1: DBMS architecture

Rewrite

query interpretation
calculus result

Figure 1.2: Query interpreter

tion, add/drop a view, update database items (e.g. tuples, relations, objects),
change authorizations, and state a query. Within the book, we will only be
concerned with the tiny last item.

1.3 Interpretation versus Compilation

There are two essential approaches to process a query: interpretation and com-
pilation.
The path of a query through a query interpreter is illustrated in Figure 1.2.
Query interpretation translates the query string into some internal representa-
tion that is mostly calculus-based. Optionally, some rewrite on this representa-
tion takes place. Typical steps during this rewrite phase are unnesting nested
queries, pushing selections down, and introducing index structures. After that,
the query is interpreted. A simple query interpreter is sketched in Figure 1.3.
The first function, interprete, takes a simple SQL block and extracts the dif-
ferent clauses, initializes the result R and calls eval. Then, eval recursively
evaluates the query by first producing the cross product of the entries in the
from clause. After all of them have been processed, the predicate is applied and
for those tuples where the where predicate evaluates to true, a result tuple is
constructed and added to the result set R. Obviously, the sketeched interpreter
is far from being efficient. A much better approach has been described by Wong
and Youssefi [918, 949].
Let us now discuss the compilation approach. The different steps are sum-
1.3. INTERPRETATION VERSUS COMPILATION 7

interprete(SQLBlock x) {

/* possible rewrites go here */

s := x.select();
f := x.from();
w := x.where();
R := ∅; /* result */
t := []; /* empty tuple */
eval(s, f , w, t, R);
return R;
}

eval(s, f , w, t, R) {

if(f .empty())
if(w(t))
R += s(t);
else
foreach(t0 ∈ first(f ))
eval(s, tail(f ), w, t ◦ t0 , R);
}

Figure 1.3: Simple query interpreter

Rewrite Rewrite / Transformation

query plan generation / code execution

calculus algebra
translation generation plan

Figure 1.4: Query compiler

marized in Figure 1.4. First, the query is rewritten. Again, unnesting nested
queries is a main technique for performance gains. Other rewrites will be dis-
cussed in Part ??. After the rewrite, the plan generation takes place. Here,
an optimal plan is constructed. Whereas typically rewrite takes place on a
calculus-based representation of the query, plan generation constructs an alge-
braic expression containing well-known operators like selection and join. Some-
times, after plan generation, the generated plan is refined: some polishing takes
place. Then, code is generated, that can be interpreted by the runtime system.
More specifically, the query execution engine—a part of the runtime system—
interpretes the query execution plan. Let us illustrate this. The following query
8 CHAPTER 1. INTRODUCTION

query

parsing CTS

abstract syntax tree

nfst
internal representation

rewrite I query
optimizer
internal representation

plan generation
internal representation

rewrite II
internal representation

code generation
execution plan

Figure 1.5: Query compiler architecture

is Query 1 of the now obsolete TPC-D benchmark [865].

SELECT RETURNFLAG, LINESTATUS,

SUM(QUANTITY) as SUM QTY,
SUM(EXTENDEDPRICE) as SUM EXTPR,
SUM(EXTENDEDPRICE * (1 - DISCOUNT)),
SUM(EXTENDEDPRICE * (1 - DISCOUNT)*
(1 + TAX)),
AVG(QUANTITY),
AVG(EXTENDEDPRICE),
AVG(DISCOUNT),
COUNT(*)
FROM LINEITEM
WHERE SHIPDDATE <= DATE ’1998-12-01’
GROUP BY RETURNFLAG, LINESTATUS
ORDER BY RETURNFLAG, LINESTATUS
1.4. REQUIREMENTS FOR A QUERY COMPILER 9

The CTS translates this query into a query execution plan. Part of the plan
is shown in Fig. 1.6. One rarely sees a query execution plan. This is the reason
why I included one. But note that the form of query execution plans differs
from DBMS to DBMS since it is (unfortunately) not standardized the way SQL
is. Most DBMSs can give the user abstract representations of query plans. It
is worth the time to look at the plans generated by some commercial DBMSs.
I do not expect the reader to understand the plan in all details. Some of
these details will become clear later. Anyway, this plan is given to the RTS
which then interprets it. Part of the result of the interpretation might look like
this:
RETURNFLAG LINESTATUS SUM QTY SUM EXTPR ...
A F 3773034 5319329289.68 ...
N F 100245 141459686.10 ...
N O 7464940 10518546073.98 ...
R F 3779140 5328886172.99 ...
This should look familar to you.
The above query plan is very simple. It contains only a few algebraic op-
erators. Usually, more algebraic operators are present and the plan is given in
a more abstract form that cannot be directly executed by the runtime system.
Fig. 2.10 gives an example of an abstracted more complex operator tree. We
will work with representations closer to this one.
A typical query compiler architecture is shown in Figure 1.5. The first com-
ponent is the parser. It produces an abstract syntax tree. This is not always the
case but this intermediate representation very much simplifies the task of fol-
lowing component. The NFST component performs several tasks. The first step
is normalization. This mainly deals with introducing new variables for subex-
pressions. Factorization and semantic analysis are performed during NFST.
Last, the abstract syntax tree is translated into the internal representation. All
these steps can typically be performed during a single path through the query
representation. Semantic analysis requires looking up schema definitions. This
can be expensive and, hence, the number of lookups should be minimized. Af-
ter NFST the core optimization steps rewrite I and plan generation take place.
Rewrite II does some polishing before code generation. These modules directly
correspond to the phases in Figure 1.4. They are typically further devided into
submodules handling subphases. The most prominent example is the prepara-
tion phase that takes place just before the actual plan generation takes place.
In our figures, we think of preparation as being part of the plan generation.

1.4 Requirements for a Query Compiler

Here are the main requirements for a query compiler:
1. Correctness

2. Completeness

3. Generate optimal plan (viz avoid the worst case)

10 CHAPTER 1. INTRODUCTION

(group
(tbscan
{segment ’lineitem.C4Kseg’ 0 4096}
{nalslottedpage 4096}
{ctuple ’lineitem.cschema’}
[ 20
LOAD_PTR 1
LOAD_SC1_C 8 1 2 // L_RETURNFLAG
LOAD_SC1_C 9 1 3 // L_LINESTATUS
LOAD_DAT_C 10 1 4 // L_SHIPDATE
LEQ_DAT_ZC_C 4 ’1998-02-09’ 1
] 2 1 // number of help-registers and selection-register
) 10 22 // hash table size, number of registers
[ // init
MV_UI4_C_C 1 0 // COUNT(*) = 0
LOAD_SF8_C 4 1 6 // L_QUANTITY
LOAD_SF8_C 5 1 7 // L_EXTENDEDPRICE
LOAD_SF8_C 6 1 8 // L_DISCOUNT
LOAD_SF8_C 7 1 9 // L_TAX
MV_SF8_Z_C 6 10 // SUM/AVG(L_QUANTITY)
MV_SF8_Z_C 7 11 // SUM/AVG(L_EXTENDEDPRICE)
MV_SF8_Z_C 8 12 // AVG(L_DISCOUNT)
SUB_SF8_CZ_C 1.0 8 13 // 1 - L_DISCOUNT
ADD_SF8_CZ_C 1.0 9 14 // 1 + L_TAX
MUL_SF8_ZZ_C 7 13 15 // SUM(L_EXTDPRICE * (1 - L_DISC))
MUL_SF8_ZZ_C 15 14 16 // SUM((...) * (1 + L_TAX))
] [ // advance
INC_UI4 0 // inc COUNT(*)
MV_PTR_Y 1 1
LOAD_SF8_C 4 1 6 // L_QUANTITY
LOAD_SF8_C 5 1 7 // L_EXTENDEDPRICE
LOAD_SF8_C 6 1 8 // L_DISCOUNT
LOAD_SF8_C 7 1 9 // L_TAX
MV_SF8_Z_A 6 10 // SUM/AVG(L_QUANTITY)
MV_SF8_Z_A 7 11 // SUM/AVG(L_EXTENDEDPRICE)
MV_SF8_Z_A 8 12 // AVG(L_DISCOUNT)
SUB_SF8_CZ_C 1.0 8 13 // 1 - L_DISCOUNT
ADD_SF8_CZ_C 1.0 9 14 // 1 + L_TAX
MUL_SF8_ZZ_B 7 13 17 15 // SUM(L_EXTDPRICE * (1 - L_DISC))
MUL_SF8_ZZ_A 17 14 16 // SUM((...) * (1 + L_TAX))
] [ // finalize
UIFC_C 0 18
DIV_SF8_ZZ_C 10 18 19 // AVG(L_QUANTITY)
DIV_SF8_ZZ_C 11 18 20 // AVG(L_EXTENDEDPRICE)
DIV_SF8_ZZ_C 12 18 21 // AVG(L_DISCOUNT)
] [ // hash program
HASH_SC1 2 HASH_SC1 3
] [ // compare program
CMPA_SC1_ZY_C 2 2 0
EXIT_NEQ 0
CMPA_SC1_ZY_C 3 3 0
])
Figure 1.6: Execution plan

4. Efficiency, generate the plan fast, do not waste memory

5. Graceful degradation
1.5. SEARCH SPACE 11

6. Robustness

First of all, the query compiler must produce correct query evaluation plans.
That is, the result of the query evaluation plan must be the result of the query
as given by the specification of the query language. It must also cover the
complete query language. The next issue is that an optimal query plan must
(should) be generated. However, this is not always that easy. That is why some
database researchers say that one must avoid the worst plan. Talking about
the quality of a plan requires us to fix the optimization goal. Several goals are
reasonable: We can optimize throughput, minimize response time, minimize
resource consumption (both, memory and CPU), and so on. A good query
compiler supports two optimization goals: minimize resource consumption and
minimize the time to produce the first tuple. Obviously, both goals cannot be
achieved at the same time. Hence, the query compiler must be instructed about
the optimization goal.
Irrespective of the optimization goal, the query compiler should produce the
query evaluation plan fast. It does not make sense to take 10 seconds to optimize
a query whose execution time is below a second. This sounds reasonable but
is not trivial to achieve. As we will see, the number of query execution plans
that are equivalent to a given query, i.e. produce the same result as the query,
can be very large. Sometimes, very large even means that not all plans can
be considered. Taking the wrong approach to plan generation will result in no
plan at all. This is the contrary of graceful degradation. Expressed positively,
graceful degradation means that in case of limited resources, a plan is generated
that may not be the optimal plan, but also not that far away from the optimal
plan.
Last, typical software quality criteria should be met. We only mentioned
robustness in our list, but others like maintainability must be met also.

1.5 Search Space

For a given query, there typically exists a high number of plans that are equiva-
lent to the plan. Not all of these plans are accessible. Only those plans that can
be generated by known optimization techniques (mainly algebraic equivalences)
can potentially be generated. Since this number may still be too large, many
query compilers restrict their search space further. We call the search space
explored by a query optimizer the actual search space. The potential search
space is the set of all plans that are known to be equivalent to the given query
by applying the state of the art of query optimization techniques. The whole
set of plans equivalent to a given query is typically unknown: we are not sure
whether all optimization techniques have been discovered so far. Figure 1.7
illustrates the situation. Note that we run into problems if the actual search
space is not a subset of the equivalent plans. Then the query compiler produces
wrong results. As we will see in several chapters of this book, some optimization
techniques have been proposed that produce plans that are not equivalent to
the original query.
12 CHAPTER 1. INTRODUCTION

equivalent
plans

actual
search space

potential
search space

Figure 1.7: Potential and actual search space

1.6 Generation versus Transformation

Two different approaches to plan generation can be distinguished:

• The transformation-based approach transforms one query execution plan

into another equivalent one. This can, for example, happen by applying
an algebraic equivalence to a query execution plan in order to yield a
better plan.

• The generic or synthetic approach produces a query execution plan by

assembling building blocks and adding one algebraic operator after the
other, until a complete query execution plan has been produced. Note
that in this approach only when all building blocks and algebraic opertors
have been introduced the internal representation can be executed. Before
that, no (complete) plan exists.

For an illustration see Figure 1.8.

A very important issue is how to explore the search space. Several well-
known approaches exist: A∗ , Branch-and-bound, greedy algorithms, hill-climbing,
dynamic programming, memoization, [205, 522, 523, 669]. These form the basis
for most of the plan generation algorithms.

1.7 Focus
In this book, we consider only the compilation of queries. We leave out many
special aspects like query optimization for multi-media database systems or
1.8. ORGANIZATION OF THE BOOK 13

a) Generative Approach b) Transformational Approach

Figure 1.8: Generation vs. transformation

multidatabase systems. These are just two omissions. We further do not con-
sider the translation of update statements which — especially in the presence
of triggers — can become quite complex. Furthermore, we assume the reader to
be familiar with the fundamentals of database systems [260, 476, 637, 696, 805]
and their implementation [397, 312]. Especially, knowledge on query execution
engines is required [341].
Last, the book presents a very personal view on query optimization. To
see other views on the same topic, I strongly recommend to read the literature
cited in this book and the references found therein. A good start are overview
articles, PhD theses, and books, e.g. [889, 318, 439, 440, 460, 534] [599, 602,
649, 819, 839, 873, 874].

1.8 Organization of the Book

The first part of the book is an introduction to the topic. It should give an
idea about the breadth and depth of query optimization. We first recapitulate
query optimization the way it is described in numerous text books on database
systems. There should be nothing really new here except for some pitfalls we
will point out. The Chapter 3 is devoted to the join ordering problem. This
has several reasons. First of all, at least one of the algorithms presented in
14 CHAPTER 1. INTRODUCTION

this chapter forms the core of every plan generator. The second reason is that
this problem allows to discuss some issues like search space sizes and problem
complexities. The third reason is that we do not have to delve into details.
We can stick to very simple (you might call them unrealistic) cost functions,
do not have to concern ourselves with details of the runtime system and the
like. Expressed positively, we can concentrate on some algorithmic aspects
of the problem. In Chapter 4 we do the opposite. The reader will not find
any advanced algorithms in this chapter but plenty of details on disks and cost
functions. The goal of the rest of the book is then to bring these issues together,
broaden the scope of the chapters, and treat problems not even touched by
them. The main issue not touched is query rewrite.
Chapter 2

Textbook Query Optimization

Almost every introductory textbook on database systems contains a section on

query optimization (or at least query processing) [260, 476, 637, 696, 805]. Also,
the two existing books on implementing database systems contain a section on
query optimization [397, 312]. In this chapter we give an excerpt1 of these
sections and subsequently discuss the problems with the described approach.
The bottom line will be that these descriptions of query optimization capture
the essence of it but contain pitfalls that need to be pointed out and gaps to
be filled.

2.1 Example Query and Outline

We use the following relations for our example query:

Student(SNo, SName, SAge, SYear)

Attend(ASNo, ALNo, AGrade)
Lecture(LNo, LTitle, LPNo)
Professor(PNo, PName)

Those attributes belonging to the key of the relations have been underlined.
With the following query we ask for all students attending a lecture by a
Professor called “Larson”.

select distinct s.SName

from Student s, Attend a, Lecture l, Professor p
where s.SNo = a.ASNo and a.ALNo = l.LNo
and l.LPNo = p.PNo and p.PName = ‘Larson’

The outline of the rest of the chapter is as follows. A query is typically

translated into an algebraic expression. Hence, we first review the relational
algebra and then discuss the translation process. Thereafter, we present the two
phases of textbook query optimization: logical and physical query optimization.
A brief discussion follows.
1
We do not claim to be fair to the above mentioned sections.

15
16 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

2.2 Algebra
Let us briefly recall the standard definition of the most important algebra-
ic operators. Their inputs are relations, that is sets of tuples. Sets do not
contain duplicates. The attributes of the tuples are assumed to be simple (non-
decomposable) values. The most common algebraic operators are defined in
Fig. 2.1. Although the common set operations union (∪), intersection (∩), and
setdifference (\) belong to the relational algebra, we did not list them. Re-
member that ∪ and ∩ are both commutative and associative. \ is neither of
them. Further, for ∪ and ∩, two distributivity laws hold. However, since these
operations are not used in this section, we refer to Figure 7.1 in Section 7.1.1.
Before we can understand Figure 2.1, we must clarify some terms and no-
tations. For us, a tuple is a mapping from a set of attribute names (or at-
tributes for short) to their corresponding values. These values are taken from
certain domains. An actual tuple is denoted embraced by brackets. They
include a comma-separated list of the form attribute name, column and at-
tribute value as in [name: ‘‘Anton’’, age: 2]. If we have two tuples
with different attribute names, they can be concatenated, i.e. we can take the
union of their attributes. Tuple concatentation is denoted by ‘◦’. For exam-
ple [name: ‘‘Anton’’, age: 2] ◦ [toy: ‘‘digger’’] results in [name:
‘‘Anton’’, age: 2, toy: ‘‘digger’’]. Let A and A0 be two sets of at-
tributes where A0 ⊆ A holds. Further let t a tuple with schema A. Then, we can
project t on the attributes in A (written as t.A). The resulting tuple contains on-
ly the attributes in A0 ; others are discarded. For example, if t is the tuple [name:
‘‘Anton’’, age: 2, toy: ‘‘digger’’] and A = {name, age}, then t.A is
the tuple [name: ‘‘Anton’’, age: 2].
A relation is a set of tuples with the same attributes. The schema of a
relation is the set of attributes. For a relation R this is sometimes denoted by
sch(R), the schema of R. We denote it by A(R) and extend it to any algebraic
expression producing a set of tuples. That is, A(e) for any algebraic expression
is the set of attributes the resulting relation defines. Consider the predicate
age = 2 where age is an attribute name. Then, age behaves like a free variable
that must be bound to some value before the predicate can be evaluated. This
motivates us to often use the terms attribute and variable synonymously. In the
above predicate, we would call age a free variable. The set of free variables of
an expression e is denoted by F(e).
Sometimes it is useful to work with sequences of attributes in compari-
son predicates. Let A = ha1 , . . . , ak i and B = hb1 , . . . , bk i be two attribute
sequences. Then for any comparison operator θ ∈ {=, ≤, <, ≥, >, 6=}, the ex-
pression AθB abbreviates a1 θb1 ∧ a2 θb2 ∧ . . . ∧ ak θbk .
Often, a natural join is defined. Consider two relations R1 and R2 . Define
Ai := A(Ri ) for i ∈ {1, 2}, and A := A1 ∩ A2 . Assume that A is non-empty
and A = ha1 , . . . , an i. If A is non-empty, the natural join is defined as
R1 B R2 := ΠA1 ∪A2 (R1 Bp ρA:A0 (R2 ))
where ρA:A0 renames the attributes ai in A to a0i in A0 and the predicate p has
the form A = A0 , i.e. a1 = a01 ∧ . . . ∧ an = a0n .
2.3. CANONICAL TRANSLATION 17

σp (R) := {r|r ∈ R, p(r)}

ΠA (R) := {r.A|r ∈ R}
R1 A R2 := {r1 ◦ r2 |r1 ∈ R1 , r2 ∈ R2 }
R1 Bp R2 := σp (R1 A R2 )

Figure 2.1: Relational algebra

For our algebraic operators, several equivalences hold. They are given in
Figure 2.2. For them to hold, we typically require that the relations involved
have disjoint attribute sets. That is, we assume—even for the rest of the book—
that attribute names are unique. This is often achieved by using the notation
R.a for a relation R or v.a for a variable ranging over tuples with an attribute
a. Another possibility is to use the renaming operator ρ.
Some equivalences are not always valid. Their validity depends on whether
some condition(s) are satisfied or not. For example, Eqv. 2.4 requires F(p) ⊆ A.
That is, all attribute names occurring in p must be contained in the attribute set
A the projection retains: otherwise, we could not evaluate p after the projection
has been applied. Although all conditions in Fig. 2.2 are of this flavor, we will
see throughout the course of the book that more complex conditions exist.

2.3 Canonical Translation

The next question is how to translate a given SQL query into the algebra.
Since the relational algebra works on sets and not bags (multisets), we can only
translate SQL queries that contain a distinct. Further, we restrict ourselves EX
to flat queries not containing any subquery. Since negation, disjunction, aggre-
gation, and quantifiers pose further problems, we neglect them. Further, we
do not allow group by, order by, union, intersection, and except in our
query. Last, we allow only attributes in the select clause and not more complex
expressions.
Thus, the generic SQL query pattern we can translate into the algebra looks
as follows:

select distinct a1 , a2 , . . . , am
from R1 c1 , R2 c2 , . . . , Rn cn
where p

Here, the Ri are relation names and the ci are correlation names. The ai in
the select clause are attribute names (or expressions of the form ci .ai ) taken
from the relations in the from clause. The predicate p is assumed to be a
conjunction of comparisions between attributes or attributes and constants.
The translation process then follows the procedure described in Figure 2.3.
First, we construct an expression that produces the cross product of the entries
18 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

σp1 ∧...∧pk (R) ≡ σp1 (. . . (σpk (R)) . . .) (2.1)

σp1 (σp2 (R)) ≡ σp2 (σp1 (R)) (2.2)
ΠA1 (ΠA2 (. . . (ΠAk (R)) . . .)) ≡ ΠA1 (R)
if Ai ⊆ Aj for i < j (2.3)
ΠA (σp (R)) ≡ σp (ΠA (R))
if F(p) ⊆ A (2.4)
(R1 A R2 ) A R3 ≡ R1 A (R2 A R3 ) (2.5)
(R1 Bp1,2 R2 ) Bp2,3 R3 ≡ R1 Bp1,2 (R2 Bp2,3 R3 )
if F(p1,2 ) ⊆ A(R1 ) ∪ A(R2 )
and F(p2,3 ) ⊆ A(R2 ) ∪ A(R3 ) (2.6)
R1 A R2 ≡ R2 A R1 (2.7)
R1 Bp R2 ≡ R2 Bp R1 (2.8)
σp (R1 A R2 ) ≡ σp (R1 ) A R2
if F(p) ⊆ A(R1 ) (2.9)
σp (R1 Bq R2 ) ≡ σp (R1 ) Bq R2
if F(p) ⊆ A(R1 ) (2.10)
ΠA (R1 A R2 ) ≡ ΠA1 (R1 ) A ΠA2 (R2 )
if A = A1 ∪ A2 , Ai ⊆ A(Ri ) (2.11)
ΠA (R1 Bp R2 ) ≡ ΠA1 (R1 ) Bp ΠA2 (R2 )
if F(p) ⊆ A, A = A1 ∪ A2 ,
and Ai ⊆ A(Ri ) (2.12)
σp (R1 θR2 ) ≡ σp (R1 )θσp (R2 )
where θ is any of ∪, ∩, \ (2.13)
ΠA (R1 ∪ R2 ) ≡ ΠA (R1 ) ∪ ΠA (R2 ) (2.14)
σp (R1 A R2 ) ≡ R1 Bp R2 (2.15)

Figure 2.2: Equivalences for the relational algebra

found in the from clause. The result is

((. . . ((R1 A R2 ) A R3 ) . . .) A Rn ).

Next, we add a selection with the where predicate:

σp ((. . . ((R1 A R2 ) A R3 ) . . .) A Rn ).

Last, we project on the attributes found in the select clause.

Πa1 ,...,an (σp ((. . . ((R1 A R2 ) A R3 ) . . .) A Rn )).

For our example query

2.3. CANONICAL TRANSLATION 19

1. Let R1 . . . Rk be the entries in the from clause of the query. Construct

the expression

R1 if k = 1
F :=
((. . . (R1 A R2 ) A . . .) A Rk ) else

2. The where clause is optional in SQL. Therefore, we distinguish the

cases that there is no where clause and that the where clause exists
and contains a predicate p. Construct the expression

F if there is no where clause
W :=
σp (F ) if the where clause contains p

3. Let s be the content of the select distinct clause. For the canonical
translation it must be of either ’*’ or a list a1 , . . . , an of attribute names.
Construct the expression

W if s = ’*’
S :=
Πa1 ,...,an (W ) if s = a1 , . . . , an

4. Return S.

Figure 2.3: (Simplified) Canonical translation of SQL to algebra

select distinct s.SName

from Student s, Attend a, Lecture l, Professor p
where s.SNo = a.ASNo and a.ALNo = l.LNo
and l.LPNo = p.PNo and p.PName = ‘Larson’

the result of the translation is

Πs.SN ame (σp (((Student[s] A Attend[a]) A Lecture[l]) A Prof essor[p]))

where p equals

s.SNo = a.ASNo and a.ALNo = l.LNo and l.LPNo = p.PNo and p.PName =
‘Larson’.

Note that we used the notation R[r] to say that a relation named R has the
correlation name r. During the course of the book we will be more precise
about the semantics of this notation and it will deviate from the one suggested
here. We will take r as a variable successively bound to the elements (tuples)
in R. However, for the purpose of this chapter it is sufficient to think of it
as associating a correlation name with a relation. The query is represented
graphically in Figure 2.7 (top).
20 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

1. translate query into its canonical algebraic expression

2. perform logical query optimization
3. perform physical query optimization

Figure 2.4: Text book query optimization

2.4 Logical Query Optimization

Textbook query optimization takes place in two separate phases. The first phase
is called logical query optimization and the second physical query optimization.
Figure 2.4 lists all these steps together with the translation step. In this section
we discuss logical query optimization. The foundation for this step is formed
by the set of algebraic equivalences (see Figure 2.2). The set of algebraic equiv-
alences spans the potential search space for this step. Given an initial algebraic
expression—resulting from the translation of the given query—the algebraic
equivalences can be used to derive all algebraic expressions that are equivalent
to the initial algebraic expression. This set of all equivalent algebraic expres-
sions can be derived by applying the equivalences first to the initial expression
and then to all derived expressions until no new expression is derivable. There-
by, the algebraic equivalences can be applied in both directions: from left to
right and from right to left. Care has to be taken that the conditions attached
to the equivalences are obeyed.
Of course, whenever we find a new algebraic equivalence that could not
be derived from those already known, adding this equivalence increases our
potential search space. On the one hand, this has the advantage that in a larg-
er search space we may find better plans. On the other hand, it increases the
already large search space which might cause problems for its exploration. Nev-
ertheless, finding new equivalences is a well-established sport among database
researchers.
One remark on better plans. Plans can only be compared if costs can be at-
tached to them via some cost function. This is what happens in most industrial
strength query optimizers. However, at the level of logical algebraic expres-
sions adding precise costs is not possible: too many implementation details are
missing. These are added to the plan during the next phase called physical
query optimization. As a consequence, we are left with plans without costs.
The only thing we can do is to heuristically judge the effectiveness of applying
an equivalence from left to right or in the opposite direction. As always with
heuristics, the hope is that they work for most queries. However, it is typically
very easy to find counter examples where the heuristics do not result in the
best plan possible. (Again, best with respect to some metrics.) This finding can
be generalized: any query optimization that takes place in more than a single
phase risks missing the best plan. This is an important observation and we will
come back to this issue more than once.
2.4. LOGICAL QUERY OPTIMIZATION 21

1. break up conjunctive selection predicates

(Eqv. 2.1: →)

2. push down selections

(Eqv. 2.2: →), (Eqv. 2.9: →)

3. introduce joins
(Eqv. 2.15: →)

4. determine join order

Eqv. 2.8, Eqv. 2.6, Eqv. 2.5, Eqv. 2.7

5. introduce and push down projections

(Eqv. 2.3: ←), (Eqv. 2.4: →),
(Eqv. 2.11: →), (Eqv. 2.12: →)

Figure 2.5: Logical query optimization

After these words of warning let us continue to discuss textbook query

optimization. Logical query optimization requires the organization of all equiv-
alences into groups. Further, the equivalences are directed. That is, it is fixed
whether they are applied in a left to right or right to left manner. A directed
equivalence is called rewrite rule. The groups of rewrite rules are then suc-
cessively applied to the initial algebraic expression. Figure 2.5 describes the
different steps performed during logical query optimization. Associated with
each step is a set of rewrite rules that are applied to the input expression to
yield a result expression. The numbers correspond to the equivalences in Fig-
ure 2.2. A small arrow indicates the direction in which the equivalences are
applied.
The first step breaks up conjunctive selection predicates. The motivation
behind this step is that selections with simple predicates can be moved around
easier. The rewrite rule used in this step is Equivalence 2.1 applied from left to
right. For our example query Step 1 results in

Πs.SN ame (
σs.SN o=a.ASN o (
σa.ALN o=l.LN o (
σl.LP N o=p.P N o (
σp.P N ame=‘Larson0 (
((Student[s] A Attend[a]) A Lecture[l]) A Prof essor[p])))))

The query is represented graphically in Figure 2.7 (middle).

Step 2 pushes selections down the operator tree. The motivation here is to
reduce the number of tuples as early as possible such that subsequent (expen-
sive) operators have smaller input. Applying this step to our example query
yields:
22 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

Πs.SN ame (
σl.LP N o=p.P N o (
σa.ALN o=l.LN o (
σs.SN o=a.ASN o (Student[s] A Attend[a])
ALecture[l])
A(σp.P N ame=‘Larson0 (Prof essor[p]))))

The query is represented graphically in Figure 2.7 (bottom).

Excursion In general, we might encounter problems when pushing down se-

lections. It may be the case that the order of the cross products is not well-suited
for pushing selections down. If this is the case, we must consider reordering cross
products during this step (Eqv. 2.7 and 2.5). To illustrate this point consider
the following example query.

select distinct s.SName

from Student s, Lecture l, Attend a
where s.SNo = a.ASNo and a.ALNo = l.LNo
and l.LTitle = ‘Databases I’

After translation and Steps 1 and 2 the algebraic expression looks like

Πs.SN ame (
σs.SN o=a.ASN o (
σa.ALN o=l.LN o (
(Student[s] A σl.LT itle=‘Databases I 0 (Lecture[l])) A Attend[a]))).

Neither of σs.SN o=a.ASN o and σa.ALN o=l.LN o can be pushed down further. Only
after reordering the cross products such as in

Πs.SN ame (
σs.SN o=a.ASN o (
σa.ALN o=l.LN o (
(Student[s] A Attend[a]) A σl.LT itle=‘Databases I 0 (Lecture[l]))))

can σs.SN o=a.ASN o be pushed down:

Πs.SN ame (
σa.ALN o=l.LN o (
σs.SN o=a.ASN o (Student[s] A Attend[a])
Aσl.LT itle=‘Databases I 0 (Lecture[l])))

This is the reason why in some textbooks reorder cross products before selec-
tions are pushed down [260]. In this appoach, reordering of cross products takes
into account the selection predicates that can possibly be pushed down to the
leaves and down to just prior a cross product. In any case, the Steps 2 and 4
are highly interdependent and there is no simple solution. 2
After this small excursion let us resume rewriting our main example query.
The next step to be applied is converting cross products to join operations (Step
3). The motivation behind this step is that the evaluation of cross products
2.4. LOGICAL QUERY OPTIMIZATION 23

is very expensive and results in huge intermediate results. For every tuple in
the left input an output tuple must be produced for every tuple in the right
input. A join operation can be implemented much more efficiently. Applying
Equivalence 2.15 from left to right to our example query results in

Πs.SN ame (
((Student[s] Bs.SN o=a.ASN o Attend[a])
Ba.ALN o=l.LN o Lecture[l])
Bl.LP N o=p.P N o (σp.P N ame=‘Larson0 (Prof essor[p])))

The query is represented graphically in Figure 2.8 (top).

The next step is really tricky and involved: we have to find an optimal
order for evaluating the joins. The join’s associativity and commutativity
gives us plenty of alternative (equivalent) evaluation plans. For our rather
simple query Figure 2.6 lists some of the possible join orders where we left
out the join predicates and used the single letter correlation names to denote
the relations to be joined. Only p abbreviates the more complex expression
σp.P N ame=‘Larson0 (Prof essor[p]). The edges show how plans can be derived
from other plans by applying commutativity (c) or associativity (a).
Unfortunately, we cannot ignore the problem of finding a good join order.
It has been shown that the order in which joins are evaluated has an enormous
influence on the total evaluation cost of a query. Thus, it is an important
problem. On the other hand, the problem is really tough. Most join ordering
problems turn out to be NP-hard. As a consequence, many different heuristics
and cost-based algorithms have been invented. They are discussed in depth in
Chapter 3. There we will also find examples showing how important (in terms
of costs) the right choice of the join order is.
To continue with our example query, we use a very simple heuristics: among
all possible joins select the one first that produces the smallest intermediate
result. This can be motivated as follows. In our current algebraic expression,
the first join to be executed is

Student[s] Bs.SN o=a.ASN o Attend[a].

All students and their attendances to some lecture are considered. The result
and hence the input to the next join will be very big. On the other hand, if there
is only one professor named Larson, the output of σp.P N ame=‘Larson0 (Prof essor[p])
is a single tuple. Joining this single tuple with the relation Lecture results in
an output containing one tuple for every lecture taught by Larson. For a large
university, this will be a small subset of all lectures. Continuing this line, we
get the following algebraic expression:

Πs.SN ame (
((σp.P N ame=‘Larson0 (Prof essor[p])
Bp.P N o=l.LP N o Lecture[l])
Bl.LN o=a.ALN o Attend[a])
Ba.ASno=s.SN o Student[s])

The query is represented graphically in Figure 2.8 (middle).

24 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

The last step minimizes intermediate results by projecting out irrelevant

attributes. An attribute is irrelevant, if it is not used further up the operator
EX tree. When pushing down projections, we only apply them just before a pipeline
breaker [341]. The reason is that for pipelined operators like selection, elimi-
nating superfluous attributes does not gain much. The only pipeline breaker
occurring in our plan is the join operator. Hence, before a join is applied, we
project on the attributes that are further needed. The result is

Πs.SN ame (
Πa.ASN o (
Πl.LN O (
Πp.P N o (σp.P N ame=‘Larson0 (Prof essor[p]))
Bp.P N o=l.LP N o
Πl.LP no,l.LN o (Lecture[l]))
Bl.LN o=a.ALN o
Πa.ALN o,a.ASN o (Attend[a]))
Ba.ASno=s.SN o
Πs.SN o,s.SN ame (Student[s]))

This expression is represented graphically in Figure 2.8 (bottom).

2.5 Physical Query Optimization

Physical query optimization adds more information to the logical query eval-
uation plan. First, there exist many different ways to access the data stored
in a database. One possibility is to scan a relation to find the relevant tuples.
Another alternative is to use an index to access only the relevant parts. If an
unclustered index is used, it might be beneficial to sort the tuple identifiers
(TIDs2 ) to turn otherwise random disk accesses into sequential accesses. Since
there is a multitutude of possibilities to access data, this topic is discussed in
depth in Chapter 4. Second, the algebraic operators used in the logical plan
may have different alternative implementations. The most prominent exam-
ple is the join operator that has many different implementations: simple nested
loop join, blockwise nested loop join, blockwise nested loop join with in-memory
hash table, index nested loop join, hybrid hash join, sort merge join, bandwidth
join, special spatial joins, set joins, and structural joins. Most of these join im-
plementations can be applied only in certain situations. Most algorithms only
implement equi-joins where the join predicate is a conjunction of simple equal-
ities. Further, all the implementations differ in cost and robustness. But also
other operators like grouping may have alternative implementations. Typically,
for these operators exist sort-based and hash-based alternatives. Third, some
operators require certain properties for their input streams. For example, a sort
merge join requires its input to be sorted on the join attributes occurring in
the equalities of the join predicate. These attributes are called join attributes.
The sortedness property can be enforced by a sort operator. The sort operator
2
Sometimes TIDs are called RIDs (Row Identifiers).
2.6. DISCUSSION 25

is thus an enforcer since it makes sure that the required property holds. As we
will see, properties and enforcers play a crucial role during plan generation.
If common subexpressions are detected at the algebraic level, it might be
beneficial to compute them only once and store the result. To do so, a tmp
operator must be introduced. Later on, we will see more of these operators
that materialize (partial) intermediate results in order to avoid the same com-
putation to be performed more than once. An alternative is to allow QEPs
which are DAGs and not merely trees (see Section ??).
Physical query optimization is concerned with all the issues mentioned
above. The outline of it is given in Figure 2.9. Let us demonstrate this for
our small example query. Let us assume that there exists an index on the name
of the professors. Then, instead of scanning the whole professor relation, it
is beneficial to use the index to retrieve only those professors named Larson.
Further, since a sort merge join is very robust and not the slowest alternative,
we choose it as an implementation for all our join operations. This requires that
we sort the inputs to the join operator on the join attributes. Since sorting is
a pipeline breaker, we introduce it between the projections and the joins. The
resulting plan is

Πs.SN ame (
Sorta.ASN o (Πa.ASN o (
Sortl.LN o (Πl.LN O (
Sortp.P N o (Πp.P N o (IdxScanp.P N ame=‘Larson0 (Prof essor[p])))
Bsmj
p.P N o=l.LP N o
Sortl.LP N o (Πl.LP no,l.LN o (Lecture[l])))
Bsmj
l.LN o=a.ALN o
Sorta.ALN o (Πa.ALN o,a.ASN o (Attend[a]))))
Bsmj
a.ASno=s.SN o
Sorts.SN o (Πs.SN o,s.SN ame (Student[s])))

where we annotated the joins with smj to indicate that they are sort merge
joins. The sort operator has the attributes on which to sort as a subscript. We
cheated a little bit with the notation of the index scan. The index is a physical
entity stored in the database. An index scan typically allows to retrieve the
TIDs of the tuples qualifying the predicate. If this is the case, another access
to the relation itself is necessary to fetch the relevant attributes (p.PNo in
our case) from the qualifying tuples of the relation. This issue is rectified in
Chapter 4. The plan is shown as an operator graph in Figure 2.10.

2.6 Discussion
This chapter left open many interesting issues. We took it for granted that the
presentation of a query is an algebraic expression or operator tree. Is this really
true? We have been very vague about ordering joins and cross products. We
only considered queries of the form select distinct. How can we assure correct
duplicate treatment for select all? We separated query optimization into two
distinct phases: logical and physical query optimization. Any separation into
26 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

different phases results in the danger of not producing an optimal plan. Logical
query optimization turned out to be a little difficult: pushing selections down
and reordering joins are mutually interdependent. How can we integrate these
steps into a single one and thereby avoid the problem mentioned? Further, our
logical query optimization was not cost based and cannot be: too much infor-
mation is still missing from the plan to associate precise costs with a logical
algebraic expression. How can we integrate the phases? How can we determine
the costs of a plan? We covered only a small fraction of SQL. We did not discuss
disjunction, negation, union, intersection, except, aggregate functions, group-
by, order-by, quantifiers, outer joins, and nested queries. Furthermore, how
about other query languages like OQL, XPath, XQuery? Further, enhance-
ments like materialized views exist nowadays in many commercial systems.
How can we exploit them beneficially? Can we exploit semantic information?
Is our exploitation of index structures complete? What happens if we encounter
NULL-values? Many questions and open issues remain. The rest of the book
is about filling these gaps.
2.6. DISCUSSION 27

B B B
B
B p c B p c p B a
B B
B l l B l B
p l s a
s a c s a c s a
c
B c
B B B
B p p
c B c p B a B B
B l
l B l B p l a s
a s a a s a a s
c
B B B
B p B
c B p c p B
B s B B
s B s B
l p a s
a l a a l a a l
c
B B B
c
B
B s c B s c s B
B B c
B p p B p B
l p s a
a l c a l c a l

B B B B
B s c B s c s B B B
B p p B p B
c a s p l
l a a l a l a
c
B B B B
B p c B p c p B B B
B s s B s B a s l p
l a a l a l a
c
B B B B
B s c B s c s B a B B
B a a B a B s a l p
l p c l p cl p
c
B B B B
B s c B s c s B a B B
B a a B a B s a p l
p l p l p l

Figure 2.6: Different join trees

28 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

Πs.SN ame

σs.SN o = a.ASN o ∧ a.ALN o = l.LN o ∧

l.LP N o = p.P N o ∧ p.P N ame = 0 Larson0
A

Student[s] Attend[a] Lecture[l] Professor[p]

Πs.SN ame

σs.SN o = a.ASN o

σa.ALN o = l.LN o

σl.LP N o = p.P N o

σp.P N ame = 0 Larson0

Student[s] Attend[a] Lecture[l] Professor[p]

Πs.SN ame

σl.P N o = p.P N o

A
σa.ALN o = l.LN o σp.P N ame = 0 Larson0

A Professor[p]

σs.SN o = a.ASN o Lecture[l]

Student[s] Attend[a]

Figure 2.7: Plans for example query (Part I)

2.6. DISCUSSION 29

Πs.SN ame

Bl.P N o = p.P N o

Ba.ALN o = l.LN o σp.P N ame = 0 Larson0

Bs.SN o =a.ASN o Lecture[l] Professor[p]

Student[s] Attend[a]

Πs.SN ame

Ba.ASN o = s.SN o

Bl.LN o = a.ALN o Student[s]

Bp.P N o = l.LP N o Attend[a]

σp.P N ame = 0 Larson0 Lecture[l]

Professor[p]

Πs.SN ame

Ba.ASN o = s.SN o

Πa.ASN o Πs.SN o,s.SN ame

Bl.LN o Student[s]

Πl.LN o Πa.ALN o,a.ASN o

Bp.P N o = l.LP N o Attend[a]

Πp.P N o Πl.LP N o,l.LN o

σp.P N ame = 0 Larson0 Lecture[l]

Professor[p]

Figure 2.8: Plans for example query (Part II)

30 CHAPTER 2. TEXTBOOK QUERY OPTIMIZATION

1. introduce index accesses

2. choose implementations for algebraic operators
3. introduce physical operators (sort, tmp)

Figure 2.9: Physical query optimization

Πs.SN ame

Bsmj
a.ASno = s.SN o

Sorta.ASN o Sorts.SN o

Πa.ASN o Πs.SN o,s.SN ame

Bsmj
l.LN o=a.ALN o Student[s]

Sortl.LN o Sorta.ALN o

Πl.LN o Πa.ALN o,a.ASN o

Bsmj
p.P N o=l.LP N o Attend[a]

Sortp.P N o Sortl.LP N o

Πp.P N o Πl.LP N o,l.LN o

IdxScanp.P N ame=0 Larson0 Lecture[l]

Professor[p]

Figure 2.10: Plan for example query after physical query optimization
Chapter 3

Join Ordering

The problem of join ordering is a very restricted and — at the same time —
a very complex one. We have touched this issue while discussing logical query
optimization in Chapter 2. Join ordering is performed in Step 4 of Figure 2.5.
In this chapter, we simplify the problem of join ordering by not considering du-
plicates, disjunctions, quantifiers, grouping, aggregation, or nested queries. Ex-
pressed positively, we concentrate on conjunctive queries with simple and cheap
join predicates. What this exactly means will become clear in the next section.
Subsequent sections discuss different algorithms for solving the join ordering
problem. Finally, we take a look at the structure of the search space. This is
important if different join ordering algorithms are compared via benchmarks.
If the wrong parameters are chosen, benchmark results can be misleading.
The algorithms of this chapter form the core of every plan generator.

3.1 Queries Considered

A conjunctive query is one whose where clause contains a (complex) predicate
which in turn is a conjunction of (simple) predicates. Hence, a conjunctive
query involves only and and no or or not operations. A simple predicate is of
the form e1 θe2 where θ ∈ {=, 6=, <, >, ≤, ≥} is a comparison operator and the ei
are simple expressions in attribute names possibly containing some simple and
cheap arithmetic operators. By cheap we mean that it is not worth applying
extra optimization techniques. In this chapter, we restrict simple predicates
even further to the form A = B for attributes A and B. A and B must also
belong to different relations such that every simple predicate in this chapter
is a join predicate. There are two reasons for this restriction. First, the most
efficient join algorithms rely on the fact that the join predicate is of the form
A = B. Such joins are called equi-joins. Any other join is called a non-equi-
join. Second, in relational systems joins on foreign key attributes of one relation
and key attributes of the other relation are very common. Other joins are rare.
A base relation is a relation that is stored (explicitly) in the database. For
the rest of the chapter, let Ri (1 ≤ i ≤ n) be n relations. These relations
can be base relations but do not necessarily have to be. They could also be
base relations to which predicates have already been supplied, e.g. as a result

31
32 CHAPTER 3. JOIN ORDERING

s.SNo = a.ASNo
Student Attend

a.ALNo = l.LNo

Professor Lecture
l.LPNo = p.PNo

p.PName = ’Larson’

Figure 3.1: Query graph for example query of Section 2.1

of applying the first three steps of logical query optimization.

Summarizing, the queries we consider can be expressed in SQL as
select distinct *
from R1 ,. . . ,Rn
where p
where p is a conjunction of simple join predicates with attributes from exactly
two relations. The latter restriction is not really necessary for the algorithms
presented in this chapter but simplifies the exposition.

3.1.1 Query Graph

A query graph is a convenient representation of a query. It is an undirected
graph with nodes R1 , . . . , Rn . For every simple predicate in the conjunction P
whose attributes belong to the relations Ri and Rj , we add an edge between
Ri and Rj . This edge is labeled by the simple predicate. From now on, we
denote the join predicate connecting Ri and Rj by pi,j . In general, pi,j can be
a conjunction of simple join predicates connecting Ri and Rj .
If query graphs are used for more than join ordering, selections need to be
represented. This is done by self-edges from the relation to which the selection
applies to itself. For the example query of Chapter 2.6, Figure 3.1 contains the
according query graph.
Query graphs can have many different shapes. The shapes that play a
certain role in query optimization and the evaluation of join ordering algorithms
are shown in Fig. 3.2. The query graph classes relevant for this chapter are chain
queries, star queries, tree queries, cyclic queries and clique queries. Note that
these classes are not disjoint and that some classes are subsets of other classes.
EX
In this chaper, we only treat connected query graphs. These can be evalu-
ated without cross products.

Excursion In general, the query graph is a hypergraph [875] as the following

example shows.
3.1. QUERIES CONSIDERED 33

chain queries star queries tree query

cyclic query cycle queries grid query clique queries

Figure 3.2: Query graph shapes

select *
from R1, R2, R3, R4
where f(R1.a, R2.a,R3.a) = g(R2.b,R3.b,R4.b)

3.1.2 Join Tree

A join tree is an algebraic expression in relation names and join operators.
Sometimes, cross products are allowed, too. A cross product is the same as a
join operator with true as its join predicate. A join tree has its name from its
graph representation. There, a join tree is a binary tree whose leaf nodes are
the relations and whose inner nodes are joins (and possibly cross products).
The edges represent the input/output relationship. Examples of join trees have
been shown in Figure 2.6.
Join trees fall into different classes. The most important classes are left-deep
trees, right-deep trees, zig-zag trees, and bushy trees. Left-deep trees are join
trees where every join has one of the relations Ri as its right input. Right-deep
trees are defined analogously. In zig-zag trees at least one input of every join
is a relation Ri . The class of zig-zag trees contains both left-deep and right-
deep trees. For bushy trees no restriction applies. Hence, the class of bushy
trees contains all of the above three classes. The roots of these notions date
back to the paper by Selinger et al. [772], where the search space of the query
optimizer was restricted to left-deep trees. There are two main reasons for this
restriction. First, only one intermediate result is generated at any time during
query evaluation. Second, the number of left-deep trees is far less than the
number of e.g. bushy trees. The other classes were then added later by other
researchers whenever they found better join trees in them. The different classes
are illustrated in Figure 2.6. From left to right, the columns contain left-deep,
zig-zag, right-deep, and bushy trees.
34 CHAPTER 3. JOIN ORDERING

Left-deep trees directly correspond to an ordering (i.e. a permutation) of

the relations. For example, the left-deep tree

((((R2 B R3 ) B R1 ) B R4 ) B R5 )

directly corresponds to the permutation R2 , R3 , R1 , R4 , R5 . It should be clear

that there is a one-to-one correspondence between permutations and left-deep
join trees. We will also use the term sequence of relations synonymously. The
notion of join ordering goes back to the times where only left-deep trees were
considered and, hence, producing an optimal join tree was equivalent to opti-
mally ordering the joins, i.e. determining a permutation with lowest cost.
Left-deep, right-deep, and zig-zag trees can be classed under the general
term linear trees. Sometimes, the term linear trees is used synonymously for
left-deep trees. We will not do so. Join trees are sometimes called operator trees
or query evaluation plans. Although this is not totally wrong, these terms have
a slightly different connotation. Operator trees typically contain more than
only join operators. Query evaluation plans (QEPs or plans for short) typically
have more information from physical query optimization associated with them.

3.1.3 Simple Cost Functions

In order to judge the quality of join trees, we need a cost function that associates
a certain positive cost with each join tree. Then, the task of join ordering is to
find among all equivalent join trees the join tree with lowest associated costs.
One part of any cost function are cardinality estimates. They are based on
the cardinalities of the relations, i.e. the number of tuples contained in them.
For a given relation Ri , we denote its cardinality by |Ri |.
Then, the cardinality of intermediate results must be estimated. This is
done by introducing the notion of join selectivity. Let pi,j be a join predicate
between relations Ri and Rj . The selectivity fi,j of pi,j is then defined as

|Ri Bpi,j Rj |
fi,j =
|Ri | ∗ |Rj |

This is the number of tuples in the join’s result divided by the number of tuples
in the Cartesian Product between Ri and Rj . If fi,j is 0.1, then only 10% of
all tuples in the Cartesian Product survive the predicate pi,j . Note that the
selectivity is always a number between 0 and 1 and that fi,j = fj,i . We use an
f and not an s, since the selectivity of a predicate is often called filter factor .
Besides the relation’s cardinalities, the selectivities of the join predicates
pi,j are assumed to be given as input to the join ordering algorithm. Therefore,
we can compute the output cardinality of a join Ri Bpi,j Rj , as

|Ri Bpi,j Rj | = fi,j |Ri ||Rj |

From this it becomes clear that if there is no join predicate for two relations
Ri and Rj , we can assume a join predicate true and associate a selectivity of
1 with it. The output cardinality is then the cardinality of the cross product
3.1. QUERIES CONSIDERED 35

between Ri and Rj . We also define fi,i = 1 for all 1 ≤ i ≤ n. This allows us to

keep subsequent formulas simple.
We now need to extend our cardinality estimation to join trees. This can be
done by recursively applying the above formula. Consider a join tree T joining
two join trees T1 and T2 , i.e. T = T1 B T2 . Then, the result cardinality |T | can
be calculated as follows. If T is a leaf Ri , then |T | := |Ri |. Otherwise,
Y
|T | = ( fi,j ) |T1 | |T2 |.
Ri ∈T1 ,Rj ∈T2

Note that this formula assumes that the selectivities are independent of each
other. Assuming independence is common but may be very misleading. More
on this issue can be found in Chapter ??. Nevertheless, we assume independence
and stick to the above formula.
For sequences of joins we can give a simple cardinality definition. Let s =
R1 , . . . , Rn be a sequence of relations. Then
n Y
Y k
|s| = ( fi,k |Rk |).
k=1 i=1

Given the above, a query graph alone is not really sufficient for the speci-
fication of a join ordering problem: cardinalities and selectivities are missing.
On the other hand, from a complete list of cardinalities and selectivities we can
derive the query graph. Obviously, the following defines a chain query with
query graph R1 − − − R2 − − − R3 :

|R1 | = 10
|R2 | = 100
|R3 | = 1000
f1,2 = 0.1
f2,3 = 0.2

In all examples, we assume for all i and j for which fi,j is not given that there
is no join predicate and hence fi,j = 1.
We now come to cost functions. The first cost function we consider is called
Cout . For a join tree T , Cout (T ) is the sum of all output cardinalities of all joins
in T . Recursively, we can define Cout as

0 if T is a single relation
Cout (T ) =
|T | + Cout (T1 ) + Cout (T2 ) if T = T1 B T2

From a theoretial point of view, Cout has many interesting properties: it is

symmetric, it has the ASI property, and it can be applied to an expression of
the logical algebra. From a practical point of view, however, it is rarely applied
(yet).
In real cost functions, the cardinalities only serve as input to more complex
formulas capturing the costs of a join implementation. Since real cost functions
36 CHAPTER 3. JOIN ORDERING

are too complex for this section, we stick to simple cost functions proposed by
Krishnamurthy, Boral, and Zaniolo [512]. They argue that these cost functions
are appropriate for main memory database systems. For the three different
join implementations nested loop join (nlj), hash join (hj), and sort merge join
(smj), they give the following cost functions:

Cnlj (e1 Bp e2 ) = |e1 ||e2 |

Chj (e1 Bp e2 ) = h|e1 |
Csmj (e1 Bp e2 ) = |e1 |log(|e1 |) + |e2 |log(|e2 |)

where ei are join trees and h is the average length of the collision chain in the
hash table. We will assume h = 1.2. All these cost functions are defined for a
single join operator. The cost of a join tree is defined as the sum of the costs of
all joins it contains. We use the symbols Cx to also denote the costs of not only
a single join but the costs of the whole tree. Hence, for sequences s of relations,
we have
n
X
Cnlj (s) = |s1 , . . . , si−1 | ∗ |si |
i=2
Xn
Chj (s) = 1.2|s1 , . . . , si−1 |
i=2
Xn n
X
Csmj (s) = |s1 , . . . , si−1 | log(|s1 , . . . , si−1 |) + |si | log(|si |)
i=2 i=2

Some notes on the cost functions are in order. First, note that these cost
functions are even for main memory a little incomplete. For example, constant
factors are missing. Second, the cost functions are mainly devised for left-deep
trees. This becomes apparent when looking at the costs of hash joins. It is
assumed that the right input is already stored in an appropriate hash table.
Obviously, this can only hold for base relations, giving rise to left-deep trees.
Third, Chj and Csmj do not work for cross products. However, we can extend
these cost functions by defining the cost of a cross product to be equal to
its output cardinality, which happens to be the cost of Cnlj . Fourth, in reality,
more complex cost functions are used and other parameters like the width of the
tuples—i.e. the number of bytes needed to store them—also play an important
role. Fifth, the above cost functions assume that the same join algorithm is
chosen throughout the whole plan. In practice, this will not be true.
For the above chain query, we compute the costs of different join trees. The
last join tree contains a cross product.
Cout Cnlj Chj Csmj
R1 B R2 100 1000 12 697.61
R2 B R3 20000 100000 120 10630.26
R1 A R3 10000 10000 10000 10000.00
(R1 B R2 ) B R3 20100 101000 132 11327.86
(R2 B R3 ) B R1 40000 300000 24120 32595.00
(R1 A R3 ) B R2 30000 1010000 22000 143542.00
3.1. QUERIES CONSIDERED 37

For the calculation of Cout note that |R1 B R2 B R3 | = 20000 is included in all
of the last three lines of its column. For the nested loop cost function, the costs
are calculated as follows:

Cnlj ((R1 B R2 ) B R3 ) = 1000 + 100 ∗ 1000 = 101000

Cnlj ((R2 B R3 ) B R1 ) = 100000 + 20000 ∗ 10 = 300000
Cnlj ((R1 A R3 ) B R2 ) = 10000 + 10000 ∗ 100 = 1010000

The reader should verify the other costs.

Several observations can be made from the above numbers:
• The costs of different join trees differ vastly under every cost function.
Hence, it is worth spending some time to find a cheap join order.

• The costs of the same join tree differ under the different cost functions.

• The cheapest join tree is (R1 B R2 ) B R3 under all four cost functions.

• Join trees with cross products are expensive.

Thus, a heuristics often used is not to consider join trees that contain
unnecessary cross products. (If the query graph consists of several un-
connected components, then and only then cross products are necessary.
In other words: if the query graph is connected, no cross products are
necessary.).

• The join order matters even for join trees without cross products.
We would like to emphasize that the join order is also relevant under other cost
functions.
Avoiding cross products is not always beneficial, as the following query
specifiation shows:

|R1 | = 1000
|R2 | = 2
|R3 | = 2
f1,2 = 0.1
f1,3 = 0.1

For Cout we have costs

Join Tree Cout

R1 B R2 200
R2 A R3 4
R1 B R3 200
(R1 B R2 ) 1 R3 240
(R2 A R3 ) 1 R1 44
(R1 B R3 ) 1 R2 240
38 CHAPTER 3. JOIN ORDERING

Note that although the absolute numbers are quite small, the ratio of the best
and the second best join tree is quite large. The reader is advised to find more
examples and to apply other cost functions.
The following example illustrates that a bushy tree can be superior to any
linear tree. Let us use the following query specification:

|R1 | = 10
|R2 | = 20
|R3 | = 20
|R4 | = 10
f1,2 = 0.01
f2,3 = 0.5
f3,4 = 0.01

If we do not consider cross products, we have for the symmetric (see below)
cost function Cout the following join trees and costs:
Join Tree Cout
R1 B R2 2
R2 B R3 200
R3 B R4 2
((R1 B R2 ) B R3 ) B R4 24
((R2 B R3 ) B R1 ) B R4 222
(R1 B R2 ) B (R3 B R4 ) 6
Note that all other linear join trees fall into one of these classes, due to the
symmetry of the cost function and the join ordering problem. Again, the reader
is advised to find more examples and to apply other cost functions.
If we want to annotate a join operator by its implementation—which is
necessary for the correct computation of costs—we write Bimpl for an imple-
mentation impl. For example, Bsmj is a sort-merge join, and the according cost
function Csmj is used to compute its costs.
Two properties of cost functions have some impact on the join ordering
problem. The first is symmetry. A cost function Cimpl is called symmetric
if Cimpl (R1 Bimpl R2 ) = Cimpl (R2 Bimpl R1 ) for all relations R1 and R2 . For
symmetric cost functions, it does not make sense to consider commutativity.
Hence, it suffices to consider left-deep trees only if we want to restrict ourselves
to linear join trees. Note that Cout , Cnlj , Csmj are symmetric while Chj is not.
The other property is the adjacent sequence interchange (ASI) property.
Informally, the ASI property states that there exists a rank function such that
the order of two subsequences is optimal if they are ordered according to the
rank function. The ASI property is formally defined in Section 3.2.2. Only for
tree queries and cost functions with the ASI property, a polynomial algorithm
to find an optimal join order is known. Our cost functions Cout and Chj have the
ASI property, Csmj does not. Summarizing the properties of our cost functions,
we see that the classification is orthogonal:
3.1. QUERIES CONSIDERED 39

ASI ¬ ASI
symmetric Cout , Cnlj Csmj
¬ symmetric Chj (see text)

For the missing case of a non-symmetric cost function not having the ASI
property, we can use the cost function of the hybrid hash join [234, 664].
We turn to another not really well-researched topic. The goal is to cut
down the number of cost functions which have to be considered for optimization
and to possibly allow for simpler cost functions, which saves time during plan
generation. Unfortunately, we have to restrict ourselves to left-deep join trees.
Let s denote a sequence or permutation of a given set of joins. We define an
equivalence relation on cost functions.

Definition 3.1.1 Let C and C 0 be two cost functions. Then

C ≡ C 0 :≺ (∀s C(s) minimal ≺ C 0 (s) minimal)

Here, s is a join sequence.

Obviously, ≡ is an equivalence relation.

Now we can define the ΣIR property.

Definition 3.1.2 A cost function C is ΣIR :≺ C ≡ Cout .

That is, ΣIR is the set of all cost functions that are equivalent to Cout .
Let us consider a very simple example. The last element of the sum in Cout
is the size of the final join (all relations are joined). This is not the case for the
following cost function:
n−1
X
0
Cout (s) := |s1 , . . . , si |
i=2

0
Obviously, we have Cout is ΣIR. The next observation shows that we can
construct quite complex ΣIR cost functions:

Observation 3.1.3 Let C1 and C2 be two ΣIR cost functions. For non-
decreasing functions f1 : R → R and f2 : R × R → R and constants c ∈ R
and d ∈ R+ , we have that EX

C1 + c
C1 ∗ d
f1 ◦ C1
f2 ◦ (C1 , C2 )

are ΣIR. Here, ◦ denotes function composition and (·, ·) function pairing.

There are of course many more possibilites of constructing ΣIR functions. For
the cost functions Chj , Csmj , and Cnlj , we now investigate which of them have
the ΣIR property.
40 CHAPTER 3. JOIN ORDERING

Let us consider Chj first. From

n
X
Chj (s) = 1.2|s1 , . . . , si−1 |
i=2
n−1
X
= 1.2|s1 | + 1.2 |s1 , . . . , si |
i=2
0
= 1.2|s1 | + 1.2Cout (s)

and observation 3.1.3, we conclude that Chj is ΣIR for a fixed relation to be
joined first. If we can optimize Cout in polynomial time, then we can optimize
Cout for a fixed starting relation. Indeed, by trying each relation as a starting
EX relation, we can find the optimal join tree in polynomial time. An algorithm
that computes the optimal solution for an arbitrary relation to be joined first
can be found in Section 3.2.2.
Now, consider Csmj . Since
n
X
|s1 , . . . , si−1 |log(|s1 , . . . , si−1 |)
i=2

is minimal if and only if

n
X
|s1 , . . . , si−1 |
i=2
Pn
is minimal and i=2 |si | log(|si |) is independent of the order of the relations
within s — that is constant — we conclude that Csmj is ΣIR.
Last, we have that Cnlj is not ΣIR. To see this, consider the following
counter example with three relations R1 , R2 , and R3 of sizes 10, 10, and 100,
9 1 1
resp. The selectivities are f1,2 = 10 , f2,3 = 10 , and f1,3 = 10 . Now,

|R1 R2 | = 90
|R1 R3 | = 100
|R2 R3 | = 100

and

Cnl (R1 R2 R3 ) = 10 ∗ 10 + 90 ∗ 100 = 9100

Cnl (R1 R3 R2 ) = 10 ∗ 100 + 100 ∗ 10 = 2000
Cnl (R2 R3 R1 ) = 10 ∗ 100 + 100 ∗ 10 = 2000

We see that R1 R2 R3 has the smallest sum of intermediate result sizes but
produces the highest cost. Hence, Cnlj is not ΣIR.

3.1.4 Classification of Join Ordering Problems

After having discussed the different classes of query graphs, join trees and
cost functions, we can classify join ordering problems. To define a certain join
ordering problem, we have to pick one entry from every class:
3.1. QUERIES CONSIDERED 41

Query Graph Classes × Possible Join Tree Classes × Cost Function

Classes

The query graph classes considered are chain, star , tree, and cyclic. For the join
tree classes we distinguish between the different join tree shapes, i.e. whether
they are left-deep, zig-zag, or bushy trees. We left out the right-deep trees, since
they do not differ in their behavior from left-deep trees. We also have to take
into account whether cross products are considered or not. For cost functions,
we use a simple classification: we only distinguish between those that have the
ASI property and those that do not. This leaves us with 4∗3∗2∗2 = 48 different
join ordering problems. For these, we will first review search space sizes and
complexity. Then, we discuss several algorithms for join ordering. Last, we give
some insight into cost distributions over the search space and how this might
influence the benchmarking of different join ordering algorithms.

3.1.5 Search Space Sizes

Since search space sizes are easier to count if cross products are allowed, we
consider them first. Then we turn to search spaces where cross products are
not considered.

Join Trees with Cross Products We consider the number of join trees for
a query graph with n relations. When cross products are allowed, the number
of left-deep and right-deep join trees is n!. By allowing cross products, the
query graph does not restrict the search space in any way. Hence, any of the n!
permutations of the n relations corresponds to a valid left-deep join tree. This
is true independent of the query graph.
Similarly, the number of zig-zag trees can be estimated independently of
the query graph. First note that for joining n relations, we need n − 1 join
operators. From any left-deep tree, we derive zig-zag trees by using the join’s
commutativity and exchange the left and right inputs. Hence, from any left-
deep tree for n relations, we can derive 2n−2 zig-zag trees. We have to subtract
another one, since the bottommost joins’ arguments are exchanged in different
left-deep trees. Thus, there exists a total of 2n−2 n! zig-zag trees. Again, this
number is independent of the query graph.
The number of bushy trees can be estimated as follows. First, we need the
number of binary trees. For n leaf nodes, the number of binary trees is given
by C(n − 1), where C(n) is defined by the recurrence

n−1
X
C(n) = C(k)C(n − k − 1)
k=0

with C(0) = 1. The numbers C(n) are called the Catalan Numbers (see [205]).
They can also be computed by the following formula:

1 2n
C(n) = .
n+1 n
42 CHAPTER 3. JOIN ORDERING

The Catalan Numbers grow in the order of Θ(4n /n3/2 ).

After we know the number of binary trees with n leaves, we now have to
attach the n relations to the leaves in all possible ways. For a given binary
tree, this can be done in n! ways. Hence, the total number of bushy trees is
n!C(n − 1). This can be simplified as follows (see also [303, 524, 854]):

1 2(n − 1)
n!C(n − 1) = n!
n n−1
1 (2n − 2)!
= n!
n (n − 1)!((2n − 2) − (n − 1))!
(2n − 2)!
=
(n − 1)!

Chain Queries, Left-Deep Join Trees, No Cartesian Product We now

derive the function that calculates the number of left-deep join trees with no
cross products for a chain query of n relations. That is, the query graph is
R1 – R2 – . . . – Rn−1 – Rn . Let us denote the number of join trees by f (n).
Obviously, for n = 0 there is only one (the empty) join tree. For n = 1, there
is also only one join tree (no join). For larger n: Consider the join trees for R1
– . . . – Rn−1 where relation Rn−1 is the k-th relation from the bottom where k
ranges from 1 to n − 1. From such a join tree we can derive join trees for all n
relations by adding relation Rn at any position following Rn−1 . There are n − k
such join trees. Only for k = 1, we can also add Rn below Rn−1 . Hence, for
k = 1 we have n join trees. How many join trees with Rn−1 at position k are
there? For k = 1, Rn−1 must be the first relation to be joined. Since we do not
consider cross products, it must be joined with Rn−2 . The next relation must
be Rn−3 , and so on. Hence, there is only one such join tree. For k = 2, the first
relation must be Rn−2 , which is then joined with Rn−1 . Then Rn−3 , . . . , R1
must follow in this order. Again, there is only one such join tree. For higher k,
for Rn−1 to occur safely at position k (no cross products) the k − 1 relations
Rn−2 , . . . , Rn−k must occur before Rn−1 . There are exactly f(k − 1) join trees
for the k − 1 relations. On each such join tree we just have to add Rn−1 on top
of it to yield a join tree with Rn−1 at position k.
Pn−1
Now we can compute the f(n) as n + k=2 f (k − 1) ∗ (n − k) for n > 1.
Solving this recurrence gives us f (n) = 2 n−1 . The proof is by induction. The
case n = 1 is trivial.
3.1. QUERIES CONSIDERED 43

The induction step for n > 1 provided by Thomas Neumann goes as follows:

n−1
X
f (n) = n + f (k − 1) ∗ (n − k)
k=2
n−3
X
= n+ f (k + 1) ∗ (n − k − 2)
k=0
n−3
X
= n+ 2k ∗ (n − k − 2)
k=0
n−2
X
= n+ k2n−k−2
k=1
n−2
X n−2
X
n−k−2
= n+ 2 + (k − 1)2n−k−2
k=1 k=2
n−2
X n−2
X
= n+ 2n−j−2
i=1 j=i
n−2
X n−i−2
X
= n+ 2j
i=1 j=0
n−2
X
= n+ (2n−i−1 − 1)
i=1
n−2
X
= n+ 2i − (n − 2)
i=1
= n + (2n−1 − 2) − (n − 2)
= 2n−1

Chain Queries, Zig-Zag Join Trees, No Cartesian Product All possible

zig-zag trees can be derived from a left-deep tree by exchanging the left and
right arguments of a subset of the joins. Since for the first join these alternatives
are already considered within the set of left-deep trees, we are left with n − 2
joins. Hence, the number of zig-zag trees for n relations in a chain query is
2n−2 ∗ 2n−1 = 22n−3 .

Chain Queries, Bushy Join Trees, No Cartesian Product We can com-

pute the number of bushy trees with no cross products for a chain query in the
following way. Let us denote this number by f(n). Again, let us assume that
the chain query has the form R1 – R2 – . . . – Rn−1 – Rn . For n = 0, we only
have the empty join tree. For n = 1 we have one join tree. For n = 2 we have
two join trees. For more relations, every subtree of the join tree must contain
a subchain in order to avoid cross products. Further, the subchain can occur
44 CHAPTER 3. JOIN ORDERING

as the left or right argument of the join. Hence, we can compute f(n) as

n−1
X
2 f(k) f(n − k)
k=1

This is equal to
2n−1 C(n − 1)

EX where C(n) are the Catalan Numbers.

Star Queries, No Cartesian Product The first join has to connect the
center relation R0 with any of the other relations. The other relations can
follow in any order. Since R0 can be the left or the right input of the first
join, there are 2 ∗ (n − 1)! possible left-deep join trees for Star Queries with no
Cartesian Product.
The number of zig-zag join trees is derived by exchanging the arguments
of all but the first join in any left-deep join tree. We cannot consider the first
join since we did so in counting left-deep join trees. Hence, the total number of
zig-zag join trees is 2 ∗ (n − 1)! ∗ 2n−2 = 2n−1 ∗ (n − 1)!.
Constructing bushy join trees with no Cartesian Product from a Star Query
other than zig-zag join trees is not possible.

Remarks The numbers for star queries are also upper bounds for tree queries.
For clique queries, no join tree containing a cross product is possible. Hence,
all join trees are valid join trees and the search space size is the same as the
corresponding search space for join trees with cross products.
To give the reader a feeling for the numbers, the following tables contain
the potential search space sizes for some n.

Join trees without cross products

chain query star query
left-deep zig-zag bushy left-deep zig-zag/bushy
n 2n−1 22n−3 2n−1 C(n − 1) 2 ∗ (n − 1)! 2n−1 (n − 1)!
1 1 1 1 1 1
2 2 2 2 2 2
3 4 8 8 4 8
4 8 32 40 12 48
5 16 128 224 48 384
6 32 512 1344 240 3840
7 64 2048 8448 1440 46080
8 128 8192 54912 10080 645120
9 256 32768 366080 80640 10321920
10 512 131072 2489344 725760 185794560
3.1. QUERIES CONSIDERED 45

With cross products/clique

left-deep zig-zag bushy
n n! 2n−2 ∗ n! n!C(n − 1)
1 1 1 1
2 2 2 2
3 6 12 12
4 24 96 120
5 120 960 1680
6 720 11520 30240
7 5040 161280 665280
8 40320 2580480 17297280
9 362880 46448640 518918400
10 3628800 928972800 17643225600

Note that in Figure 2.6 only 32 join trees are listed, whereas the number of
bushy trees for chain queries with four relations is 40. The missing eight cases
are those zig-zag trees which are symmetric (i.e. derived by applying commu-
tativity to all occurring joins) to the ones contained in the second column.
From these numbers, it becomes immediately clear why historically the
search space of query optimizers was restricted to left-deep trees and cross
products for connected query graphs were not considered.

3.1.6 Problem Complexity

The complexity of the join ordering problem depends on several parameters.
These are the shape of the query graph, the class of join trees to be considered,
whether cross products are considered or not, and whether the cost function
has the ASI property or not. Not for all the combinations complexity results
are known. What is known is summarized in the following table.

Query graph Join tree Cross products Cost function Complexity

general left-deep no ASI NP-hard
tree/star/chain left-deep no one join method (ASI) P
star left-deep no two join methods (NLJ+SMJ) NP-hard
general/tree/star left-deep yes ASI NP-hard
chain left-deep yes — open
general bushy no ASI NP-hard
tree bushy no — open
star bushy no ASI P
chain bushy no any P
general bushy yes ASI NP-hard
tree/star/chain bushy yes ASI NP-hard
Ibaraki and Kameda were the first who showed that the problem of deriving
optimal left-deep trees for cyclic queries is NP-hard for a cost function for an
n-way nested loop join implementation [432]. The proof was repeated for the
cost function Cout which has the ASI property [190, 864]. In both proofs, the
46 CHAPTER 3. JOIN ORDERING

clique problem was used for the reduction [316]. Cout was also used in the other
proofs of NP-hardness results. The next line goes back to the same paper.
Ibaraki and Kameda also described an algorithm to solve the join ordering
problem for tree queries producing optimal left-deep trees for a special cost
function for a nested-loop n-way join algorithm. Their algorithm was based on
the observation that their cost function has the ASI property. For this case,
they could derive an algorithm from an algorithm for a sequencing problem for
job scheduling designed by Monma and Sidney [616]. They, in turn, used an
earlier result by Lawler [529]. The algorithm of Ibaraki and Kameda was slightly
generalized by Krishnamurthy, Boral, and Zaniolo, who were also able to sketch
a more efficient algorithm. It improves the time bounds from O(n2 log n) to
O(n2 ). The disadvantage of both approaches is that with every relation, a fixed
(i.e. join-tree independent) join implementation must be associated before the
optimization starts. Hence, it only produces optimal trees if there is only one
join implementation available or one is able to guess the optimal join method
before hand. This might not be the case. The polynomial algorithm which we
term IKKBZ is described in Section 3.2.2.
For star queries, Ganguly investigated the problem of generating optimal
left-deep trees if no cross products but two different cost functions (one for
nested loop join, the other for sort merge join) are allowed. It turned out that
this problem is NP-hard [308].
The next line is due to Cluet and Moerkotte [190]. They showed by reduc-
tion from 3DM that taking into account cross products results in an NP-hard
problem even for star queries. Remember that star queries are tree queries and
general graphs.
The problem for general bushy trees follows from a result by Scheufele and
Moerkotte [756]. They showed that building optimal bushy trees for cross
products only (i.e. all selectivities equal one) is already NP-hard. This result
also explains the last two lines.
By noting that for star queries, all bushy trees that do not contain a cross
product are left-deep trees, the problem can be solved by the IKKBZ algorithm
for left-deep trees. Ono and Lohman showed that for chain queries dynamic
programming considers only a polynomial number of bushy trees if no cross
products are considered [640]. This is discussed in Section 3.2.4.
The table is rather incomplete. Many open problems exist. For example,
if we have chain queries and consider cross products: is the problem NP-hard
or in P? Some results for this problem have been presented [756], but it is
still an open problem (see Section 3.2.7). Open is also the case where we
produce optimal bushy trees with no cross products for tree queries. Yet another
example of an open problem is whether we could drop the ASI property and are
still able to derive a polynomial algorithm for a tree query. This is especially
important, since the cost function for a sort-merge algorithm does not have the
ASI property.
Good summaries of complexity results for different join ordering problems
can be found in the theses of Scheufele [754] and Hamalainen [388].
Given that join ordering is an inherently complex problem with no polyno-
mial algorithm in sight, one might wonder whether there exists good polynomial
3.2. DETERMINISTIC ALGORITHMS 47

approximation algorithms. Chances are that even this is not the case. Chatter-
ji, Evani, Ganguly, and Yemmanuru showed that three different optimization
problems — all asking for linear join trees — are not approximable [142].

3.2 Deterministic Algorithms

3.2.1 Heuristics
We now present some simple heuristic solutions to the problem of join ordering.
These heuristics only produce left-deep trees. Since left-deep trees are equiv-
alent with permutations, these heuristics order the joins according to some
criterion.
The core algorithm for the heuristics discussed here is the greedy algorithm
(for an introduction see [205]). In greedy algorithms, a weight is associated with
each entity. In our case, weights are associated with each relation. A typical
weight function is the cardinality of the relation (|R|). Given a weight function
weight, a greedy join ordering algorithm works as follows:

GreedyJoinOrdering-1({R1 , . . . , Rn }, (*weight)(Relation))
Input: a set of relations to be joined and a weight function
Output: a join order
S = ; // initialize S to the empty sequence
R = {R1 , . . . , Rn }; // let R be the set of all relations
while(!empty(R)) {
Let k be such that: weight(Rk ) = minRi ∈R (weight(Ri ));
R\ = Rk ; // eliminate Rk from R
S◦ = Rk ; // append Rk to S
}
return S
This algorithm takes cross products into account. If we are only interested
in left-deep join trees with no cross products, we have to require that Rk is
connected to some of the relations contained in S in case S 6= . Note that a
more efficient implementation would sort the relations according to their weight.
Not all heuristics can be implemented with a greedy algorithm as simple as
above. An often-used heuristics is to take the relation next that produces the
smallest (next) intermediate result. This cannot be determined by the relation
alone. One must take into account the sequence S already processed, since on-
ly then the selectivities of all predicates connecting relations in S and the new
relation are deducible. And we must take the product of these selectivities and
the cardinality of the new relation in order to get an estimate of the intermedi-
ate result’s cardinality. As a consequence, the weights become relative to S. In
other words, the weight function now has two parameters: a sequence of rela-
tions already joined and the relation whose relative weight is to be computed.
Here is the next algorithm:
48 CHAPTER 3. JOIN ORDERING

GreedyJoinOrdering-2({R1 , . . . , Rn },
(*weight)(Sequence of Relations, Relation))
Input: a set of relations to be joined and a weight function
Output: a join order
S = ; // initialize S to the empty sequence
R = {R1 , . . . , Rn }; // let R be the set of all relations
while(!empty(R)) {
Let k be such that: weight(S, Rk ) = minRi ∈R (weight(S, Ri ));
R\ = Rk ; // eliminate Rk from R
S◦ = Rk ; // append Rk to S
}
return S

Note that for this algorithm, sorting is not possible. GreedyJoinOrdering-2

can be improved by taking every relation as the starting one.

GreedyJoinOrdering-3({R1 , . . . , Rn }, (*weight)(Sequence of Relations, Relation))

Input: a set of relations to be joined and a weight function
Output: a join order
Solutions = ∅;
for (i = 1; i ≤ n; + + i) {
S = Ri ; // initialize S to a singleton sequence
R = {R1 , . . . , Rn } \ {Ri }; // let R be the set of all relations
while(!empty(R)) {
Let k be such that: weight(S, Rk ) = minRi ∈R (weight(S, Ri ));
R\ = Rk ; // eliminate Rk from R
S◦ = Rk ; // append Rk to S
}
Solutions += S;
}
return cheapest in Solutions

In addition to the relative weight function mentioned before, another of-

ten used relative weight function is the product of the selectivities connecting
relations in S with the new relation. This heuristics is sometimes called MinSel .
The above two algorithms generate linear join trees. Fegaras proposed a
heuristic (named Greedy Operator Ordering (GOO)) to generate bushy join
trees [272, 273]. The idea is as follows. A set of join trees Trees is initialized
such that it contains all the relations to be joined. It then investigates all pairs
of trees contained in Trees. Among all of these, the algorithm joins the two
trees that result in the smallest intermediate result when joined. The two trees
are then eliminated from Trees and the new join tree joining them is added to
it. The algorithm then looks as follows:
3.2. DETERMINISTIC ALGORITHMS 49

GOO({R1 , . . . , Rn })
Input: a set of relations to be joined
Output: join tree
Trees := {R1 , . . . , Rn }
while (|Trees| != 1) {
find Ti , Tj ∈ Trees such that i 6= j, |Ti B Tj | is minimal
among all pairs of trees in Trees
Trees − = Ti ;
Trees − = Tj ;
Trees + = Ti B Tj ;
}
return the tree contained in Trees;
Our GOO variant differs slightly from the one proposed by Fegaras. He uses
arrays, explicitly handles the forming of the join predicates, and materializes
intermediate result sizes. Hence, his algorithm is a little more elaborated, but
we assume that the reader is able to fill in the gaps.
None of our algorithms so far considers different join implementations. An
explicit consideration of commutativity for non-symmetric cost functions could
also help to produce better join trees. The reader is asked to work out the details
of these extensions. In general, the heuristics do not produce the optimal plan. EX
The reader is advised to find examples where the heuristics are far off the best
possible plan. EX

3.2.2 Determining the Optimal Join Order in Polynomial Time

Since the general problem of join ordering is NP-hard, we cannot expect to
find a polynomial solution for it. However, for special cases, we can expect to
find solutions that work in polynomial time. These solutions can also be used
as heuristics for the general case, either to find a not-that-bad join tree or to
determine an upper bound for the costs that is then fed into a search procedure
to prune the search space.
The most general case for which a polynomial solution is known is charac-
tized by the following features:
• the query graph must be acyclic
• no cross products are considered
• the search space is restricted to left-deep trees
• the cost function must have the ASI property
The algorithm was presented by Ibaraki and Kameda [432]. Later Krishna-
murthy, Boral, and Zaniolo presented it again for some other cost functions
(still having the ASI property) [512]. They also observed that the upper bound
O(n2 log n) of the original algorithm could be improved to O(n2 ). In any case,
the algorithm is based on an algorithm discovered by Monma and Sidney for
job scheduling [529, 616] . Let us call the (unimproved) algorithm IKKBZ-
Algorithm.
50 CHAPTER 3. JOIN ORDERING

The IKKBZ-Algorithm considers only join operations that have a cost func-
tion of the form:
cost(Ri 1 Rj ) = |Ri | ∗ hj (|Rj |)
where each Rj can have its own cost function hj . We denote the set of hj by
H and parameterize cost functions with it. Example instanciations are
• hj ≡ 1.2 for main memory hash-based joins
• hj ≡ id for nested-loop joins
where id is the identity function. Let us denote by ni the cardinality of the
relation Ri (ni := |Ri |). Then, the hi (ni ) represent the costs per input tuple to
be joined with Ri .
The algorithm works as follows. For every relation Rk it computes the
optimal join order under the assumption that Rk is the first relation in the join
sequence. The resulting subproblems then resemble a job-scheduling problem
that can be solved by the Monma-Sidney-Algorithm [616].
In order to present this algorithm, we need the notion of a precedence graph.
A precedence graph is formed by taking a node in the (undirected) query graph
and making this node a root node of a (directed) tree where the edges point
away from the selected root node. Hence, for acyclic, connected query graphs—
those we consider in this section—a precedence graph is a tree. We construct
the precedence graph of a query graph G = (V, E) as follows:
• Make some relation Rk ∈ V the root node of the precedence graph.
• As long as not all relations are included in the precedence graph: Choose
a relation Ri ∈ V , such that (Rj , Ri ) ∈ E is an edge in the query graph
and Rj is already contained in the (partial) precedence graph constructed
so far and Ri is not. Add Rj and the edge Rj → Ri to the precedence
graph.
A sequence S = v1 , . . . , vk of nodes conforms to a precedence graph G = (V, E)
if the following conditions are satisfied:
1. for all i (2 ≤ i ≤ k) there exists a j (1 ≤ j < i) with (vj , vi ) ∈ E and
2. there is no edge (vi , vj ) ∈ E for i > j.
For non-empty sequences U and V in a precedence graph, we write U → V if,
according to the precedence graph, U must occur before V . This requires U
and V to be disjoint. More precisely, there can only be paths from nodes in U
to nodes in V and at least one such path exists.
Consider the following query graph:
R1 R5

R3 R4

R2 R6
3.2. DETERMINISTIC ALGORITHMS 51

For this query graph, we can derive the following precedence graphs:

R1 R2 R3

R3 R3 R1 R2 R4

R2 R4 R1 R4 R5 R6

R5 R6 R5 R6

R4 R5 R6

R3 R5 R6 R4 R4

R3 R5 R6 R3 R5 R3

R1 R2 R1 R2

The IKKBZ-Algorithm takes a single precedence graph and produces a new

one that is totally ordered. From this total order it is very easy to construct a
corresponding join graph: the following figure contains a totally ordered prece-
dence graph (left-hand side) as generated by the IKKBZ-Algorithm and the
corresponding join graph on the right-hand side.

R1 B

R2 B R6

R3 B R5

R4 B R4

R5 B R3

R6 R1 R2

Define

R1,2,...,k := R1 1 R2 1 · · · 1 Rk
n1,2,...,k := |R1,2,...,k |

For a given precedence graph, let Ri be a relation and Ri be the set of relations
from which there exists a path to Ri . Then, in any join tree adhering to the
52 CHAPTER 3. JOIN ORDERING

Q in Ri and only those will be joined before Ri .

precedence graph, all relations
Hence, we can define si = Rj ∈Ri fi,j for i > 1. Note that for any i only one j
with fi,j 6= 1 exists in the product. If the precedence graph is a chain, then the
following holds:

n1,2,...,k+1 = n1,2...,k ∗ sk+1 ∗ nk+1

We define s1 = 1. Then we have

n1,2 = s2 ∗ (n1 ∗ n2 ) = (s1 ∗ s2 ) ∗ (n1 ∗ n2 )

and, in general,
k
Y
n1,2,...,k = (si ∗ ni ).
i=1

We call the si selectivities, although they depend on the precedence graph.

The costs for a totally ordered precedence graph G can thus be computed
as follows:
n
X
CostH (G) = [n1,2,...,i−1 ∗ hi (ni )]
i=2
Xn i−1
Y
= [( sj ∗ nj ) ∗ hi (ni )]
i=2 j=1

If we define hi (ni ) = si ni , then CostH ≡ Cout . The factor si ni determines by

how much the input relation to be joined with Ri changes its cardinality after
the join has been performed. If si ni is less than one, we call the join decreasing,
if it is larger than one, we call the join increasing. This distinction plays an
important role in the heuristic discussed in Section 3.2.3.
The cost function can also be defined recursively.

Definition 3.2.1 Define the cost function CH as follows:

CH () = 0
CH (Rj ) = 0 if Rj is the root
CH (Rj ) = hj (nj ) else
CH (S1 S2 ) = CH (S1 ) + T (S1 ) ∗ CH (S2 )

where

T () = 1
Y
T (S) = (si ∗ ni )
Ri ∈S

It is easy to prove by induction that CH is well-defined and that CH (G) =

EX CostH (G).
3.2. DETERMINISTIC ALGORITHMS 53

Definition 3.2.2 Let A and B be two sequences and V and U two non-empty
sequences. We say that a cost function C has the adjacent sequence interchange
property (ASI property) if and only if there exists a function T and a rank
function defined for sequences S as

T (S) − 1
rank(S) =
C(S)

such that for non-empty sequences S = AU V B the following holds

C(AU V B) ≤ C(AV U B) ≺ rank(U ) ≤ rank(V ) (3.1)

if AU V B and AV U B satisfy the precedence constraints imposed by a given

precedence graph.

Lemma 3.2.3 The cost function CH defined in Definition 3.2.1 has the ASI
property.

The proof is very simple. Using the definition of CH , we have

CH (AU V B) = CH (A)
+T (A)CH (U )
+T (A)T (U )CH (V )
+T (A)T (U )T (V )CH (B)

and, hence,

CH (AU V B) − CH (AV U B) = T (A)[CH (V )(T (U ) − 1) − CH (U )(T (V ) − 1)]

= T (A)CH (U )CH (V )[rank(U ) − rank(V )]

The proposition follows. 2

Definition 3.2.4 Let M = {A1 , . . . , An } be a set of node sequences in a given

precedence graph. Then, M is a called a module if for all sequences B that do
not overlap with the sequences in M one of the following conditions holds:

• B → Ai , ∀ 1 ≤ i ≤ n

• Ai → B, ∀ 1 ≤ i ≤ n

• B 6→ Ai and Ai 6→ B, ∀ 1 ≤ i ≤ n

Lemma 3.2.5 Let C be any cost function with the ASI property and {A, B}
a module. If A → B and additionally rank(B) ≤ rank(A), then we find an
optimal sequence among those in which B directly follows A.
54 CHAPTER 3. JOIN ORDERING

Proof Every optimal permutation must have the form (U, A, V, B, W ), since
A → B. Assumption: V 6= . If rank(V ) ≤ rank(A), then we can ex-
change V and A without increasing the costs. If rank(A) ≤ rank(V ), we
have rank(B) ≤ rank(V ) due to the transitivity of ≤. Hence, we can exchange
B and V without increasing the costs. Both exchanges produce legal sequences
obeying the precedence graph, since {A, B} is a module. 2
If the precedence graph demands A → B but rank(B) ≤ rank(A), we speak
of contradictory sequences A and B. Since the lemma shows that no non-empty
subsequence can occur between A and B, we will combine A and B into a new
single node replacing A and B. This node represents a compound relation
comprising all relations in A and B. Its cardinality is computed by multiplying
the cardinalities of all relations occurring in A and B, and its selectivity s is
the product of all the selectivities si of the relations Ri contained in A and B.
The continued process of this step until no more contradictory sequence exists
is called normalization. The opposite step, replacing a compound node by the
sequence of relations it was derived from, is called denormalization.
We can now present the algorithm IKKBZ.

IKKBZ(G)
Input: an acyclic query graph G for relations R1 , . . . , Rn
Output: the best left-deep tree
R = ∅;
for (i = 1; i ≤ n; + + i) {
Let Gi be the precedence graph derived from G and rooted at Ri ;
T = IKKBZ-Sub(Gi );
R+ = T ;
}
return best of R;

IKKBZ-Sub(Gi )
Input: a precedence graph Gi for relations R1 , . . . , Rn rooted at some Ri
Output: the optimal left-deep tree under Gi
while (Gi is not a chain) {
let r be the root of a subtree in Gi whose subtrees are chains;
IKKBZ-Normalize(r);
merge the chains under r according to the rank function
in ascending order;
}
IKKBZ-Denormalize(Gi );
return Gi ;

IKKBZ-Normalize(r)
Input: the root r of a subtree T of a precedence graph G = (V, E)
Output: a normalized subchain
while (∃ r0 , c ∈ V , r →∗ r0 , (r0 , c) ∈ E: rank(r0 ) > rank(c)) {
replace r0 by a compound relation r00 that represents r0 c;
3.2. DETERMINISTIC ALGORITHMS 55

};

We do not give the details of IKKBZ-Denormalize, as it is trivial.

R1
100 18
R2 1 1 R5 19
2 3 R2 R3 R4 20
10 100 49 24
50 25
R1 R4
1
3
5
1
R6,7 5
1
R3 4 2 R6 R7
1
100 A) 10 10 20 D)
5
R5 6
R1
R1
19
R2 R3 R4 20
49 24
50 25 R2 R3 R4,6,7 199
320
4 49 24
R5 R6 5 50 25
5
6 5
E) R5 6
B)
1
R7 2
R1

19
R2 R3 R4 20
49 24
50 25

R5 R6,7 3 F)
C) 5
5
6

Figure 3.3: Illustrations for the IKKBZ Algorithm

Let us illustrate the algorithm IKKBZ-Sub by a simple example. We use

the cost function Cout . Figure 3.3 A) shows a query graph. The relations are
annotated with their sizes and the edges with the join selectivities. Chosing R1
as the root of the precedence graph results in B). There, the nodes are annotated
by the ranks of the relations. R4 is the root of a subtree all of whose subtrees
are chains. Hence, we normalize it. For R5 , there is nothing to do. The ranks
of R6 and R7 are contradictory. We form a compound relation R6,7 , calculate
its cardinality, selectivity, and rank. The latter is shown in C). Merging the
two subchains under R4 results in D). Now R1 is the root of a subtree with
only chains underneath. Normalization detects that the ranks for R4 and R5
are contradictory. E) shows the tree after introducing the compound relation
R4,5 . Now R4,5 and R6,7 have contradictory ranks, and we replace them by the
compound relation R4,5,6,7 as shown in F). Merging the chains under R1 gives
G). Since this is a chain, we leave the loop and denormalize. The final result is
shown in H).
56 CHAPTER 3. JOIN ORDERING

I R1 R2 R3 R4 II p2,3

p1,2 p3,4

III a) p3,4 III b) B

B R4
p2,3
B R3

R1 R2
p1,2

IV a) p2,3 IV b) B

B B
p3,4
R1 R2 R3 R4

p1,2

V a) p2,3 V b) B

B B
p1,2 p3,4
R1 R2 R3 R4

Figure 3.4: A query graph, its directed join graph, some spanning trees and
join trees

We can use the IKKBZ-Algorithm to derive a heuristics also for cyclic

queries, i.e. for general query graphs. In a first step, we determine a mini-
mal spanning tree of the query graph. It is then used as the input query graph
for the IKKBZ-Algorithm. Let us call this the IKKBZ-based Heuristics.

3.2.3 The Maximum-Value-Precedence Algorithm

Lee, Shih, and Chen proposed a very interesting heuristics for the join ordering
problem [530]. They use a weighted directed join graph (WDJG) to represent
queries. Within this graph, every join tree corresponds to a spanning tree.
3.2. DETERMINISTIC ALGORITHMS 57

Given a conjunctive query with join predicates P . For a join predicate p ∈ P ,

we denote by R(p) the relations whose attributes are mentioned in p.

Definition 3.2.6 The directed join graph of a conjunctive query with join pred-
icates P is a triple G = (V, Ep , Ev ), where V is the set of nodes and Ep and
Ev are sets of directed edges defined as follows. For any two nodes u, v ∈ V , if
R(u) ∩ R(v) 6= ∅ then (u, v) ∈ Ep and (v, u) ∈ Ep . If R(u) ∩ R(v) = ∅, then
(u, v) ∈ Ev and (v, u) ∈ Ev . The edges in Ep are called physical edges, those
in Ev virtual edges.

Note that in G for every two nodes u, v, there is an edge (u, v) that is either
physical or virtual. Hence, G is a clique.
Let us see how we can derive a join tree from a spanning tree of a directed
join graph. Figure 3.4 I) gives a simple query graph Q corresponding to a chain
and Part II) presents Q’s directed join graph. Physical edges are drawn by
solid arrows, virtual edges by dotted arrows. Let us first consider the spanning
tree shown in Part III a). It says that we first execute R1 Bp1,2 R2 . The next
join predicate to evaluate is p2,3 . Obviously, it does not make much sense to
execute R2 Bp2,3 R3 , since R1 and R2 have already been joined. Hence, we
replace R2 in the second join by the result of the first join. This results in the
join tree (R1 Bp1,2 R2 ) Bp2,3 R3 . For the same reason, we proceed by joining
this result with R4 . The final join tree is shown in Part III b). Part IV a)
shows another spanning tree. The two joins R1 Bp1,2 R2 and R3 Bp3,4 R4 can
be executed independently and do not influence each other. Next, we have to
consider p2,3 . Both R2 and R3 have already been joined. Hence, the last join
processes both intermediate results. The final join tree is shown in Part IV b).
The spanning tree shown in Part V a) results in the same join tree shown in
Part V b). Hence, two different spanning trees can result in the same join tree.
However, the spanning tree in Part IV a) is more specific in that it demands
R1 Bp1,2 R2 to be executed before R3 Bp3,4 .
Next, take a look at Figure 3.5. Part I), II), and III a) show a query graph,
its directed join tree and a spanning tree. To build a join tree from the spanning
tree we proceed as follows. We have to execute R2 Bp2,3 R3 and R3 B R4 first.
In which way we do so is not really fixed by the spanning tree. So let us do
both in parallel. Next is p1,2 . The only dependency the spanning tree gives
us is that it should be executed after p3,4 . Since there is no common relation
between those two, we perform R1 Bp1,2 R2 . Last is p4,5 . Since we find p3,4
below it, we use the intermediate result produced by it as a replacement for R4 .
The result is shown in Part III b). It has three loose ends. Additional joins are
required to tie the partial results together. Obviously, this is not what we want.
A spanning tree that avoids this problem of additional joins is called effective.
It can be shown that a spanning tree T = (V, E) is effective if it satisfies the
following conditions [530]:

1. T is a binary tree,

2. for all inner nodes v and node u with (u, v) ∈ E it holds that R∗ (T (u)) ∩
R(v) 6= ∅, and
58 CHAPTER 3. JOIN ORDERING

I R1 R2 R3 R4 R5

II p1,2 p2,3 p3,4 p4,5

III a) p4,5 III b) B

?
B B B
R5
p2,3 p1,2
R2 R3 R3 R4 R1 R2

p3,4

Figure 3.5: A query graph, its directed join tree, a spanning tree and its problem

3. for all nodes v, u1 , u2 with u1 6= u2 , (u1 , v) ∈ E, and (u2 , v) ∈ E one of

the following two conditions holds:

(a) ((R∗ (T (u1 )) ∩ R(v)) ∩ (R∗ (T (u2 )) ∩ R(v))) = ∅ or

(b) (R∗ (T (u1 )) ∩ R(v) = R(v)) ∨ (R∗ (T (u2 )) ∩ R(v) = R(v)).

Thereby, we denote by T (v) the partial tree rooted at v and by R∗ (T 0 ) =

∪v∈T 0 R(v) the set of all relations in subtree T 0 .
We see that the spanning tree in Figure 3.5 III a) is ineffective since, for
example, R(p2,3 ) ∩ R(p4,5 ) = ∅. The spanning tree in Figure 3.4 IV a) is also
ineffective. During the algorithm we will take care—by checking the above
conditions—that only effective spanning trees are generated.
We now assign weights to the edges of the directed join graph. For two
nodes v, u ∈ V define u u v := R(u) ∩ R(v). For simplicity, we assume that
every predicate involves exactly two relations. Then for all u, v ∈ V , u u v
contains a single relation. Let v ∈ V be a node with R(v) = {Ri , Rj }. We
abbreviate Ri Bv Rj by Bv . Using these notations, we can attach weights to
the edges to define the weighted directed join graph.

Definition 3.2.7 Let G = (V, Ep , Ev ) be a directed join graph for a conjunctive

query with join predicates P . The weighted directed join graph is derived from
G by attaching a weight to each edge as follows:

• Let (u, v) ∈ Ep be a physical edge. The weight wu,v of (u, v) is defined as

| Bu |
wu,v = .
|u u v|

• For virtual edges (u, v) ∈ Ev , we define wu,v = 1.

3.2. DETERMINISTIC ALGORITHMS 59

(Lee, Shih, and Chen actually attach two weights to each edge: one additional
weight for the size of the tuples (in bytes) [530].)
The weights of physical edges are equal to the si of the dependency graph
used in the IKKBZ-Algorithm (Section 3.2.2). To see this, assume R(u) =
{R1 , R2 }, R(v) = {R2 , R3 }. Then

| Bu |
wu,v =
|u u v|
|R1 Bu R2 |
=
|R2 |
f1,2 |R1 | |R2 |
=
|R2 |
= f1,2 |R1 |

Hence, if the join R1 Bu R2 is executed before the join R2 Bv R3 , the input size
to the latter join changes by a factor wu,v . This way, the influence of a join
on another join is captured by the weights. Since those nodes connected by a
virtual edge do not influence each other, a weight of 1 is appropriate.
Additionally, we assign weights to the nodes of the directed join graph.
The weight of a node reflects the change in cardinality to be expected when
certain other joins have been executed before. They are specified by a (partial)
spanning tree S. Given S, we denote by BSpi,j the result of the join Bpi,j if all
joins preceding pi,j in S have been executed. Then the weight attached to node
pi,j is defined as
| BSpi,j |
w(pi,j , S) = .
|Ri Bpi,j Rj |
For empty sequences , we define w(pi,j , ) = |Ri Bpi,j Rj |. Similarly, we define
the cost of a node pi,j depending on other joins preceding it in some given
spanning tree S. We denote this by cost(pi,j , S). The actual cost function can
be one we have introduced so far or any other one. In fact, if we have a choice
of several join implementations, we can take the minimum over all their cost
functions. This then choses the most effective join implementation.
The maximum value precedence algorithm works in two phases. In the first
phase, it searches for edges with a weight smaller than one. Among these, the
one with the biggest impact is chosen. This one is then added to the spanning
tree. In other words, in this phase, the costs of expensive joins are minimized by
making sure that (size) decreasing joins are executed first. The second phase
adds edges such that the intermediate result sizes increase as little as possible.

MVP(G)
Input: a weighted directed join graph G = (V, Ep , Ev )
Output: an effective spanning tree
Q1 .insert(V ); /* priority queue with largest node weights w(·) first */
Q2 = ∅; /* priority queue with smallest node weights w(·) first */
G0 = (V 0 , E 0 ) with V 0 = V and E 0 = Ep ; /* working graph */
60 CHAPTER 3. JOIN ORDERING

S = (VS , ES ) with VS = V and ES = ∅; /* resulting effective spanning tree */

while (!Q1 .empty() && |ES | < |V | − 1) { /* Phase I */
v = Q1 .head();
among all (u, v) ∈ E 0 , wu,v < 1 such that
S 0 = (V, ES0 ) with ES0 = ES ∪ {(u, v)} is acyclic and effective
select one that maximizes cost(Bv , S) - cost(Bv , S 0 );
if (no such edge exists) {
Q1 .remove(v);
Q2 .insert(v);
continue;
}
MvpUpdate((u, v));
recompute w(·) for v and its ancestors; /* rearranges Q1 */
}
while (!Q2 .empty() && |ES | < |V | − 1) { /* Phase II */
v = Q2 .head();
among all (u, v), (v, u) ∈ E 0 denoted by (x, y) henceforth
such that
S 0 = (V, ES0 ) with ES0 = ES ∪ {(x, y)} is acyclic and effective
select the one that minimizes cost(Bv , S 0 ) - cost(Bv , S);
MvpUpdate((x, y));
recompute w(·) for y and its ancestors; /* rearranges Q2 */
}
return S;

MvpUpdate((u, v))
Input: an edge to be added to S
Output: side-effects on S, G0 ,
ES ∪ = {(u, v)};
E 0 \ = {(u, v), (v, u)};
E 0 \ = {(u, w)|(u, w) ∈ E 0 }; /* (1) */
E 0 ∪ = {(v, w)|(u, w) ∈ Ep , (v, w) ∈ Ev }; /* (3) */
if (v has two inflowing edges in S) { /* (2) */
E 0 \ = {(w, v)|(w, v) ∈ E 0 };
}
if (v has one outflowing edge in S) { /* (1) in paper but not needed */
E 0 \ = {(v, w)|(v, w) ∈ E 0 };
}

Note that in order to test for the effectiveness of a spanning tree in the
algorithm, we just have to check the conditions for the node the selected edge
leads to.
MvpUpdate first adds the selected edge to the spanning tree. It then elim-
inates edges that need not to be considered for building an effective spanning
tree. Since (u, v) has been added, both (u, v) and (v, u) do not have to be
considered any longer. Also, since effective spanning trees are binary trees, (1)
3.2. DETERMINISTIC ALGORITHMS 61

every node must have only one parent node and (2) at most two child nodes.
The edges leading to a violation are eliminated by MvpUpdate in the lines com-
mented with the corresponding numbers. For the line commented (3) we have
the situation that u → v 99K w and u → w in G. This means that u and w have
common relations, but v and w do not. Hence, the result of performing v on
the result of u will have a common relation with w. Thus, we add a (physical)
edge v → w.

3.2.4 Dynamic Programming

Algorithms
Consider the two join trees

(((R1 B R2 ) B R3 ) B R4 ) B R5

and
(((R3 B R1 ) B R2 ) B R4 ) B R5 .
If we know that ((R1 BR2 )BR3 ) is cheaper than ((R3 BR1 )BR2 ), we know that
the first join tree is cheaper than the second. Hence, we could avoid generating
the second alternative and still won’t miss the optimal join tree. The general
principle behind this is the optimality principle (see [204]). For the join ordering
problem, it can be stated as follows.1

Let T be an optimal join tree for relations R1 , . . . , Rn . Then, every

subtree S of T must be an optimal join tree for the relations it
contains.

To see why this holds, assume that the optimal join tree T for relations R1 , . . . , Rn
contains a subtree S which is not optimal. That is, there exists another join
tree S 0 for the relations contained in S with strictly lower costs. Denote by
T 0 the join tree derived by replacing S in T by S 0 . Since S 0 contains the same
relations as S, T 0 is a join tree for the relations R1 , . . . , Rn . The costs of the join
operators in T and T 0 that are not contained in S and S 0 are the same. Then,
since the total cost of a join tree is the sum of the costs of the join operators
and S 0 has lower costs than S, T 0 has lower costs than T . This contradicts the
optimality of T .
The idea of dynamic programming applied to the generation of optimal join
trees now is to generate optimal join trees for subsets of R1 , . . . , Rn in a bottom-
up fashion. First, optimal join trees for subsets of size one, i.e. single relations,
are generated. From these, optimal join trees of size two, three and so on until
n are generated.
Let us first consider generating optimal left-deep trees. There, join trees for
subsets of size k are generated from subsets of size k − 1 by adding a new join
operator whose left argument is a join tree for k − 1 relations and whose right
argument is a single relation. Exchanging left and right gives us the procedure
for generating right-deep trees. If we want to generate zig-zag trees since our
1
The optimality principle does not hold in the presence of properties.
62 CHAPTER 3. JOIN ORDERING

cost function is asymmetric, we have to consider both alternatives and take

the cheapest one. We capture this in a procedure CreateJoinTree that takes
two join trees as arguments and generates the above-mentioned alternatives.
In case we want to consider different implementations for the join, we have to
perform the above steps for all of them and return the cheapest alternative.
Summarizing, the pseudo-code for CreateJoinTree looks as follows:

CreateJoinTree(T1 , T2 )
Input: two (optimal) join trees T1 and T2 .
for linear trees, we assume that T2 is a single relation
Output: an (optimal) join tree for joining T1 and T2 .
BestTree = NULL;
for all implementations impl do {
if(!RightDeepOnly) {
Tree = T1 Bimpl T2
if (BestTree == NULL || cost(BestTree) > cost(Tree)) {
BestTree = Tree;
}
}
if(!LeftDeepOnly) {
Tree = T2 Bimpl T1
if (BestTree == NULL || cost(BestTree) > cost(Tree)) {
BestTree = Tree;
}
}
}
return BestTree;

The boolean variables RightDeepOnly and LeftDeepOnly are used to restrict

the search space to right-deep trees and left-deep trees. If both are false, zig-zag
trees are generated. However, CreateJoinTree also generates bushy trees, if
none of the input trees is a single relation.
In case of linear trees, T2 will be the single relation in all of our algorithms.
CreateJoinTree should not copy T1 or T2 . Instead, the newly generated join
trees should share T1 and T2 by using pointers. Further, the join trees generated
do not really need to be generated except for the final (best) join tree: the cost
functions should be implemented such that they can be evaluated if they are
given the left and right argument of the join.
Using CreateJoinTree, we are now ready to present our first dynamic pro-
gramming algorithm in pseudo-code.

DP-Linear-1({R1 , . . . , Rn })
Input: a set of relations to be joined
Output: an optimal left-deep (right-deep, zig-zag) join tree
3.2. DETERMINISTIC ALGORITHMS 63

{R1 R2 R3 R4}

{R1 R2 R4} {R1 R3 R4}

{R1 R2 R3} {R2 R3 R4}

{R1 R4}
{R1 R3} {R2 R4}
{R1 R2} {R3 R4}
{R2 R3}

R1 R2 R3 R4

Figure 3.6: Search space with sharing under optimality principle

for (i = 1; i <= n; ++i) {

BestTree({Ri }) = Ri ;
}
for (i = 1; i < n; ++i) {
for all S ⊆ {R1 , . . . , Rn }, |S| = i do {
for all Rj ∈ {R1 , . . . , Rn }, Rj 6∈ S do {
if (NoCrossProducts && !connected({Rj }, S)) {
continue;
}
CurrTree = CreateJoinTree(BestTree(S),Rj );
S 0 = S ∪ {Rj };
if (BestTree(S 0 ) == NULL || cost(BestTree(S 0 )) > cost(CurrTree)) {
BestTree(S 0 ) = CurrTree;
}
}
}
}
return BestTree({R1 , . . . , Rn });
NoCrossProducts is a boolean variable indicating whether cross products should
be investigated. Of course, if the join graph is not connected, there must be
64 CHAPTER 3. JOIN ORDERING

a cross product, but for DP-Linear-1 and subsequent algorithms we assume

that it is connected. The boolean function connected returns true, if there is
a join predicate between one of the relations in its first argument and one of
the relations in its second. The variable BestTree keeps track of the best join
trees generated for every subset of the relations {R1 , . . . , Rn }. How this is done
may depend on several parameters. The approaches are to use a hash table or
XC search an array of size 2n (−1). Another issue is how to represent the sets of relations.
space size Typically, bitvector representations are used. Then, testing for membership,
difference computing a set’s complement, adding elements and unioning is cheap. Yet
problem another issue is the order in which join trees are generated. The procedure
DP-Linear-1 takes the approach to generate the join trees for subsets of size
1, 2, . . . , n. To do so, it must be able to access the subsets of {R1 , . . . , Rn } or
their respective join trees by their size. One possibility is to chain all the join
trees for subsets of a given size k (1 ≤ k ≤ n) and to use an array of size n to
keep pointers to the start of the lists. In this case, to every join tree the set of
relations it contains is attached, in order to be able to perform the test Ri 6∈ S.
One way to do this is to embed a bitvector into each join tree node.
Figure 3.6 illustrates how the procedure DP-Linear-1 works. In its first
loop, it initializes the bottom row of join trees of size one. Then it computes
the join trees joining exactly two relations. This is indicated by the next group
of join trees. Since the figure leaves out commutativity, only one alternative
join tree for every subset of size two is generated. This changes for subsets of
size three. There, three alternative join trees are generated. Only the best join
tree is retained. This is indicated by the ovals that encircle three join trees.
Only this best join tree of size three is used to generate the final best join tree.
The short clarification after the algorithm already adumbrated that the
order in which join trees are generated is not compulsory. The only necessary
condition is the following.

Let S be a subset of {R1 , . . . , Rn }. Then, before a join tree for S

can be generated, the join trees for all relevant subsets of S must
already be available.

Note that this formulation is general enough to also capture the generation of
bushy trees. It is, however, a little vague due to its reference to “relevance”.
EX For the different join tree classes, this term can be given a precise semantics.
Let us take a look at an alternative order to join tree generation. Assume
that sets of relations are represented as bitvectors. A bitvector is nothing more
than a base two integer. Successive increments of an integer/bitvector lead to
different subsets. Further, the above condition is satisfied. We illustrate this by
a small example. Assume that we have three relations R1 , R2 , R3 . The i-th bit
from the right in a three-bit integer indicates the presence of Ri for 1 ≤ i ≤ 3.
3.2. DETERMINISTIC ALGORITHMS 65

000 {}
001 {R1 }
010 {R2 }
011 {R1 , R2 }
100 {R3 }
101 {R1 , R3 }
110 {R2 , R3 }
111 {R1 , R2 , R3 }

This observation leads to another formulation of our dynamic programming

algorithm. For this algorithm, it is very convenient to use an array of size 2n
to represent BestTree(S) for subsets S of {R1 , . . . , Rn }.

DP-Linear-2({R1 , . . . , Rn })
Input: a set of relations to be joined
Output: an optimal left-deep (right-deep, zig-zag) join tree
for (i = 1; i <= n; ++i) {
BestTree(1 << i − 1) = Ri ;
}
for (S = 1; S < 2n ; ++S) {
if (BestTree(S) != NULL) continue;
for all i ∈ S do {
S 0 = S \ {i};
CurrTree = CreateJoinTree(BestTree(S 0 ),Ri );
if (BestTree(S) == NULL || cost(BestTree(S)) > cost(CurrTree)) {
BestTree(S) = CurrTree;
}
}
}
return BestTree(2n − 1);

DP-Linear-2 differs from DP-Linear-1 not only in the order in which join trees
are generated. Another difference is that it takes cross products into account.
From DP-Linear-2, it is easy to derive an algorithm that explores the space
of bushy trees.

DP-Bushy({R1 , . . . , Rn })
Input: a set of relations to be joined
Output: an optimal bushy join tree
for (i = 1; i <= n; ++i) {
BestTree(1 << i − 1) = Ri ;
}
for (S = 1; S < 2n ; ++S) {
if (BestTree(S) != NULL) continue;
for all S1 ⊂ S do {
66 CHAPTER 3. JOIN ORDERING

S2 = S \ S1 ;
CurrTree = CreateJoinTree(BestTree(S1 ), BestTree(S2 ));
if (BestTree(S) == NULL || cost(BestTree(S)) > cost(CurrTree)) {
BestTree(S) = CurrTree;
}
}
}
return BestTree(2n − 1);

This algorithm also takes cross products into account. The critical part is the
generation of all subsets of S. Fortunately, Vance and Maier [885] provide a
code fragment with which subset bitvector representations can be generated
very efficiently. In C, this fragment looks as follows:

S1 = S & - S;
do {
/* do something with subset S1 */
S1 = S & (S1 - S);
} while (S1 != S);

S represents the input set. S1 iterates through all subsets of S where S itself and
the empty set are not considered. Analogously, all supersets an be generated
as follows:

S1 = ~S & - ~S;
/* do something with first superset S1 */
while (S1 ) {
S1 = ~S & (S1 - ~S)
/* do something with superset S1
}

S represents the input set. S1 iterates through all supersets of S including S

itself.

Excursion Problem: exploiting orderings devastates the optimality principle.

XC ToDo Example: . . .

XC ToDo Excursion Pruning . . .

Number of Entries to be stored in the dynamic programming table

If dynamic programming uses a static hash table, determining its size in advance
is necessary as the search space sizes differ vastly for different query graphs.
In general, for every connected subgraph of the query graph one entry must
3.2. DETERMINISTIC ALGORITHMS 67

exist. Chains require far fewer entries than cliques. It would be helpful to
have a small routine solving the following problem: given a query graph, how
many connected subgraph are there? Unfortunatly, this problem is #-P hard
as Sutner, Satyanarayana, and Suffel showed [843]. They build on results by
Valiant [883] and Lichtenstein [546]. (For a definition of #P-hard see the book
by Lewis and Papadimitriou [544] or the original paper by Valiant [882].)
However, for specific cases, these numbers can be given. If cross products
are consideres, the number of join trees stored in the dynamic programming
table is
2n − 1
which is one for each non-empty subset of relations.
If we do not consider cross products, the number of entries in the dynamic
programming table corresponds to the number of connected subgraphs of the
query graph. For connected query graphs, we denote this by #csg. For chains,
cycles, stars, and cliques with n nodes, we have

n(n + 1)
#csgchain (n) = (3.2)
2
#csgcycle (n) = n2 − n + 1 (3.3)
star n−1
#csg (n) = 2 +n−1 (3.4)
clique n
#csg (n) = 2 − 1 (3.5)

These equations can be derived from the following by summing over k > 1
where k gives the size of the connected subset:

#csgchain (n, k) = (n − k + 1)

1 n=k
#csgcycle (n, k) =
n else

n k=1
#csgstar (n, k) = n−1
k>1
k−1
n
#csgclique (n, k) =
k

Number of Join Trees Investigated

The number of join trees investigated by dynamic programming was extensively
studied by Ono and Lohman [639, 640]. In order to estimate these numbers, we
assume that CreateJoinTree produces a single join tree and hence counts as
one although it may evaluate the costs for several join alternatives. We further
do not count the initial join trees containing only a single relation.

Join Trees With Cartesian Product For the analysis of dynamic pro-
gramming variants that do consider cross products, the notion of join-pair is
helpful. Let S1 and S2 be subsets of the nodes (relations) of the query graph.
We say (S1 , S2 ) is a join-pair, if and only if
68 CHAPTER 3. JOIN ORDERING

1. S1 and S2 are disjoint

If (S1 , S2 ) is a join-pair, then (S2 , S1 ) is a join pair. Further, if T1 is a join tree
for the relations in S1 and T2 is one for those in S2 , then we can construct two
valid join trees T1 1 T2 and T2 1 T1 where the joins may be cross products.
Hence, the number of join-pairs coincides with the search space a dynamic pro-
gramming algorithm explores. In fact, the number of join-pairs is the minimum
number of join trees any dynamic programming algorithm that considers cross
products has to investigate.
If CreateJoinTree considers commutativity of joins, the number of calls
to it is precisely expressed by the count of non-symmetric join-pairs. In other
implementations CreateJoinTree might be called for all join-pairs and, thus,
may not consider commutativity. The two formulas below only count non-
symmetric join pairs.
The numbers of linear and bushy join trees with cartesian product is easiest
to determine. They are independent of the query graph. For linear join trees,
the number of join trees investigated by dynamic programming is equal to the
number of non-symmetric join-pairs which is
n(n + 1)
n2n−1 −
2
Dynamic programming investigates the following number of bushy trees if
cross products are considered.

(3n − 2n+1 + 1)
2
This is equal to the number of non-symmetric join-pairs.

Join Trees without Cross Products In this paragraph, we assume that the
query graph is connected. For the analysis of dynamic programming variants
that do not consider cross products, it is helpful to have the notion of a csg-
cmp-pair. Let S1 and S2 be subsets of the nodes (relations) of the query graph.
We say (S1 , S2 ) is a csg-cmp-pair , if and only if
1. S1 induces a connected subgraph of the query graph,

2. S2 induces a connected subgraph of the query graph,

3. S1 and S2 are disjoint, and

4. there exists at least one edge connected a node in S1 to a node in S2 .

If (S1 , S2 ) is a csg-cmp-pair, then (S2 , S1 ) is a valid csg-cmp-pair. Further, if T1
is a join tree for the relations in S1 and T2 is one for those in S2 , then we can
construct two valid join trees T1 1 T2 and T2 1 T1 . Hence, the number of csg-
cmp-pairs coincides with the search space a dynamic programming algorithm
explores. In fact, the number of csg-cmp-pairs is the minimum number of
join trees any dynamic programming algorithm that does not consider cross
products has to investigate.
3.2. DETERMINISTIC ALGORITHMS 69

If CreateJoinTree considers commutativity of joins, the number of calls

to it is precisely expressed by the count of non-symmetric csg-cmp-pairs. In
other implementations CreateJoinTree might be called for all csg-cmp-pairs
and, thus, may not consider commutativity.
Let us denote the number of non-symmetric csg-cmp-pairs by #ccp. Then
1 3
#ccpchain (n) = (n − n)
6
#ccpcycle (n) = (n3 − 2n2 + n)/2
#ccpstar (n) = (n − 1)2n−2
#ccpclique (n) = (3n − 2n+1 + 1)/2

These numbers have to be multiplied by two if we want to count all csg-cmp-

pairs.
If we do not consider composite inners, that is we restrict ourselves to left-
deep join trees, then dynamic programming makes the following number of calls
to CreateJoinTree for chain queries [640]:

(n − 1)2

The following table presents some results for the above formulas.

without cross products with cross products

chain star any query graph
linear bushy linear linear bushy
n (n − 1)2 (n3 − n)/6 (n − 1)2n−2 n2n−1 − n(n + 1)/2 (3n − 2n+1 + 1)/2
2 1 1 1 1 1
3 4 4 4 6 6
4 9 10 12 22 25
5 16 20 32 65 90
6 25 35 80 171 301
7 36 56 192 420 966
8 49 84 448 988 3025
9 64 120 1024 2259 9330
10 81 165 2304 5065 28501

Compare this table with the actual sizes of the search spaces in Section 3.1.5.
The dynamic programming algorithms can be implemented very efficiently
and often form the core of commercial plan generators. However, they have
the disadvantage that no plan is generated if they run out of time or space
since the search space they have to explore is too big. One possible remedy
goes as follows. Assume that a dynamic programming algorithm is stopped
in the middle of its way through its actual search space. Further assume that
the largest plans generated so far involve k relations. Then the cheapest of the
plans with k relations is completed by applying any heuristics (e.g. MinSel). The
completed plan is then returned. In Section 3.4.5, we will see two alternative
solutions. Another solution is presented in [480].
70 CHAPTER 3. JOIN ORDERING

DPsize
Input: a connected query graph with relations R = {R0 , . . . , Rn−1 }
Output: an optimal bushy join tree without cross products
for all Ri ∈ R {
BestPlan({Ri }) = Ri ;
}
for all 1 < s ≤ n ascending // size of plan
for all 1 ≤ s1 ≤ s/2 { // size of left/right subplan
s2 = s − s1 ; // size of right/left subplan
for all S1 ⊂ R in BestPlan with |S1 | = s1
S2 ⊂ R in BestPlan with |S2 | = s2 {
++InnerCounter;
6 S1 ∩ S2 ) continue;
if (∅ =
if not (S1 connected to S2 ) continue;
++CsgCmpPairCounter;
p1 =BestPlan(S1 );
p2 =BestPlan(S2 );
CurrPlan = CreateJoinTree(p1 , p2 );
if (cost(BestPlan(S1 ∪ S2 )) > cost(CurrPlan)) {
BestPlan(S1 ∪ S2 ) = CurrPlan;
}
}
}
OnoLohmanCounter = CsgCmpPairCounter / 2;
return BestPlan({R0 , . . . , Rn−1 });

Figure 3.7: Algorithm DPsize

Generating Bushy Trees without Cross Products

We now discuss dynamic programming algorithms to generate bushy trees with-

out cross products. For this section, we assume that the query graph is con-
nected. We will present three algorithms. The first algorithm (DPsize) gener-
ates its plans in increasing size of subplans and, hence, is a generalization of
DP-Linear-1. The second algorithm (DPsub) geneerates its plans by consider-
ing plans subsets as does DP-Linear-2. An analysis of these two algorithms
reveals that both are far away from the lower bound presented in the previous
sections. Thus, a third algorithm (DPccp) which reaches this lower bound is
presented. The results of this section are taken from [607, 605].

Size-based enumeration: DPsize In general, dynamic programming gen-

erates solutions for a larger problem in a bottom-up fashion by combining so-
lutions for smaller problems. Taking this description literally, we can construct
optimal plans of size n by joining plans p1 and p2 of size k and n − k. We just
have to take care that (1) the sets of relations contained in p1 and p2 do not
overlap, and (2) there is a join predicate connecting a relation p1 with a rela-
tion in p2 . After this remark, we are prepared to understand the pseudocode
3.2. DETERMINISTIC ALGORITHMS 71

for algorithm DPsize (see Fig. 3.7). A table BestPlan associates with each
set of relations the best plan found so far. The algorithm starts by initializing
this table with plans of size one, i.e. single relations. After that, it constructs
plans of increasing size (loop over s). Thereby, the first size considered is two,
since plans of size one have already been constructed. Every plan joining n
relations can be constructed by joining a plan containing s1 relations with a
plan containing s2 relations. Thereby, si > 0 and s1 + s2 = n must hold. Thus,
the pseudocode loops over s1 and sets s2 accordingly. Since for every possible
size there exist many plans, two more loops are necessary in order to loop over
the plans of sizes s1 and s2 . (This is best implemented by keeping list heads for
every possible plan size pointing to a first plan of this size and chaining plans
of equal size via some next-pointer.) Then, conditions (1) and (2) from above
are tested. Only if their outcome is positive, we consider joining the plans p1
and p2 . The result is a plan CurrPlan. Let S be the relations contained in
CurrPlan. If BestPlan does not contain a plan for the relations in S or the
one it contains is more expensive than CurrPlan, we register CurrPlan with
BestPlan.
The algorithm DPsize can be made more efficient in case of s1 = s2 . The
algorithm as stated cycles through all plans p1 joining s1 relations. For each
such plan, all plans p2 of size s2 are tested. Assume that plans of equal size are
represented as a linked list. If s1 = s2 , then it is possible to iterate through the
list for retrieving all plans p1 . For p2 we consider the plans succeeding p1 in
the list. Thus, the complexity can be decreased from P (s1 ) ∗ P (s2 ) to P (s1 ) ∗
P (s2 )/2, where P (si ) denotes the number of plans of size si . The following
formulas are valid only for the variant of DPsize where this optimization has
been incorporated (see [605] for details).
If the counter InnerCounter is initialized with zero at the beginning of the
algorithm DPsize, then we are able to derive analytically its value after DPsize
terminates. Since this value of the inner counter depends on the query graph,
we have to distinguish several cases. For chain, cycle, star, and clique queries,
chain , I cycle , I star , and I clique the value of InnerCounter
we denote by IDPsize DPsize DPsize DPsize
after termination of algorithm DPsize.
chain (n) =
For chain queries, we then have: IDPsize

1/48(5n4 + 6n3 − 14n2 − 12n) n even
4 3 2
1/48(5n + 6n − 14n − 6n + 11) n odd

cycle
For cycle queries, we have: IDPsize (n) =
1 4
4 (n − n3 − n2 ) n even
1 4 3 2
4 (n − n − n + n) n odd

star (n) =
For star queries, we have: IDPsize
(
2(n−1)
22n−4 − 1/4 n−1 + q(n) n even
2(n−1) n−1

22n−4 − 1/4 n−1 + 1/4 (n−1)/2 + q(n) n odd
72 CHAPTER 3. JOIN ORDERING

DPsub
Input: a connected query graph with relations R = {R0 , . . . , Rn−1 }
Output: an optimal bushy join tree
for all Ri ∈ R {
BestPlan({Ri }) = Ri ;
}
for 1 ≤ i < 2n − 1 ascending {
S = {Rj ∈ R|(bi/2j c mod 2) = 1}
if not (connected S) continue; // ∗
for all S1 ⊂ S, S1 6= ∅ do {
++InnerCounter;
S2 = S \ S1 ;
if (S2 = ∅) continue;
if not (connected S1 ) continue;
if not (connected S2 ) continue;
if not (S1 connected to S2 ) continue;
++CsgCmpPairCounter;
p1 = BestPlan(S1 );
p2 = BestPlan(S2 );
CurrPlan = CreateJoinTree(p1 , p2 );
if (cost(BestPlan(S)) > cost(CurrPlan)) {
BestPlan(S) = CurrPlan;
}
}
}
OnoLohmanCounter = CsgCmpPairCounter / 2;
return BestPlan({R0 , . . . , Rn−1 });

Figure 3.8: Algorithm DPsub

with q(n) = n2n−1 − 5 ∗ 2n−3 + 1/2(n2 − 5n + 4). For clique queries, we have:
clique
IDPsize (n) =
(
22n−2 − 5 ∗ 2n−2 + 1/4 2n n
n − 1/4 n/2 + 1 n even
22n−2 − 5 ∗ 2n−2 + 1/4 2n
n +1 n odd
n √
Note that 2nn is in the order of Θ(4 / n).
Proofs of the above formulas as well as implementation details for the algo-
rithm DPsize can be found in [605].

Subset-Driven Enumeration: DPsub Fig. 3.8 presents the pseudocode for

the algorithm DPsub. The algorithm first initializes the table BestPlan with
all possible plans containing a single relation. Then, the main loop starts. It
iterates over all possible non-empty subsets of {R0 , . . . , Rn−1 } and constructs
the best possible plan for each of them. The enumeration makes use of a
bitvector representation of sets: The integer i induces the current subset S with
its binary representation. Taken as bitvectors, the integers in the range from 1
3.2. DETERMINISTIC ALGORITHMS 73

to 2n − 1 exactly represent the set of all non-empty subsets of {R0 , . . . , Rn−1 },

including the set itself. Further, by starting with 1 and incrementing by 1, the
enumeration order is valid for dynamic programming: for every subset, all its
subsets are generated before the subset itself.
This enumeration is very fast, since increment by one is a very fast operation.
However, the relations contained in S may not induce a connected subgraph of
the query graph. Therefore, we must test for connectedness. The goal of the
next loop over all subsets of S is to find the best plan joining all the relations
in S. Therefore, S1 ranges over all non-empty, strict subsets of S. This can be
done very efficiently by applying the code snippet of Vance and Maier [884, 885].
Then, the subset of relations contained in S but not in S1 is assigned to S2 .
Clearly, S1 and S2 are disjoint. Hence, only connectedness tests have to be
performed. Since we want to avoid cross products, S1 and S2 both must induce
connected subgraphs of the query graph, and there must be a join predicate
between a relation in S1 and one in S2 . If these conditions are fulfilled, we can
construct a plan CurrPlan by joining the plans associated with S1 and S2 . If
BestPlan does not contain a plan for the relations in S or the one it contains
is more expensive than CurrPlan, we register CurrPlan with BestPlan.
For chain, cycle, star, and clique queries, we denote by IDPsub chain , I cycle ,
DPsub
star , and I clique the value of InnerCounter after termination of algorithm
IDPsub DPsub
DPsub.
For chains, we have
chain
IDPsub (n) = 2n+2 − nn − 3n − 4 (3.6)

For cycles, we have

cycle
IDPsub (n) = n2n + 2n − 2n2 − 2 (3.7)

For stars, we have

star
IDPsub (n) = 2 ∗ 3n−1 − 2n (3.8)
For cliques, we have
clique
IDPsub (n) = 3n − 2n+1 + 1 (3.9)

The number of failures for the additional check can easily be calculated as
2n − #csg(n) − 1.

Sample numbers Fig. 3.9 contains tables with values produced by our for-
mulas for input query graph sizes between 2 and 20. For different kinds of query
graphs, it shows the number of csg-cmp-pairs (#ccp). and the values for the
inner counter after termination of DPsize and DPsub applied to the different
query graphs.
Looking at these numbers, we observe the following:

• For chain and cycle queries, the DPsize soon becomes much faster than
DPsub.
74 CHAPTER 3. JOIN ORDERING

Chain Cycle
n #ccp/2 DPsub DPsize #ccp/2 DPsub DPsize
2 1 2 1 1 2 1
5 20 84 73 40 140 120
10 165 3962 1135 405 11062 2225
15 560 130798 5628 1470 523836 11760
20 1330 4193840 17545 3610 22019294 37900
Star Clique
n #ccp/2 DPsub DPsize #ccp/2 DPsub DPsize
2 1 2 1 1 2 1
5 32 130 110 90 180 280
10 2304 38342 57888 28501 57002 306991
15 114688 9533170 57305929 7141686 14283372 307173877
20 4980736 2323474358 59892991338 1742343625 3484687250 309338182241

Figure 3.9: Size of the search space for different graph structures

• For star and clique queries, the DPsub soon becomes much faster than
DPsize.

• Except for clique queries, the number of csg-cmp-pairs is orders of mag-

nitude less than the value of InnerCounter for all DP-variants.

From the latter observation we can conclude that in almost all cases the tests
performed by both algorithms in their innermost loop fail. Both algorithms
are far away from the theoretical lower bound given by #ccp. This conclusion
motivates us to derive a new algorithm whose InnerCounter value is equal to
the number of csg-cmp-pairs.

Csg-cmp-pair enumeration-based algorithm: DPccp The algorithm DPsub

solves the join ordering problem for a given subset S of relations by considering
all pairs of disjoint subproblems which were already solved. Since the enumer-
ation of subsets is very fast, this is a very efficient strategy if the search space
is dense, e.g. for clique queries. However, if the search space is sparse, e.g. for
chain queries, the DPsub algorithm considers many subproblems which are not
connected and, therefore, are not relevant for the solution, i.e. the tests in the
innermost loop fail for the majority of cases. The main idea of our algorithm
DPccp is that it only considers pairs of connected subproblems. More precisely,
the algorithm considers exactly the csg-cmp-pairs of a graph.
Thus, our goal is to efficiently enumerate all csg-cmp-pairs (S1 , S2 ). Clearly,
we want to enumerate every pair once and only once. Further, the enumera-
tion must be performed in an order valid for dynamic programming. That is,
whenever a pair (S1 , S2 ) is generated, all non-empty subsets of S1 and S2 must
have been generated before as a component of a pair. The last requirement is
that the overhead for generating a single csg-cmp-pair must be constant or at
most linear. This condition is necessary in order to beat DPsize and DPsub.
3.2. DETERMINISTIC ALGORITHMS 75

DPccp
Input: a connected query graph with relations R = {R0 , . . . , Rn−1 }
Output: an optimal bushy join tree
for all Ri ∈ R) {
BestPlan({Ri }) = Ri ;
}
for all csg-cmp-pairs (S1 , S2 ), S = S1 ∪ S2 {
++InnerCounter;
++OnoLohmanCounter;
p1 = BestPlan(S1 );
p2 = BestPlan(S2 );
CurrPlan = CreateJoinTree(p1 , p2 );
if (cost(BestPlan(S)) > cost(CurrPlan)) {
BestPlan(S) = CurrPlan;
}
CurrPlan = CreateJoinTree(p2 , p1 );
if (cost(BestPlan(S)) > cost(CurrPlan)) {
BestPlan(S) = CurrPlan;
}
}
CsgCmpPairCounter = 2 * OnoLohmanCounter;
return BestPlan({R0 , . . . , Rn−1 });

Figure 3.10: Algorithm DPccp

0 1 1 1 1 0 0 0 1
...
2 3 2 3 3 2 3 2 3 2 2 3

Graph 1. 2. 3. 4. 5. 6. 7. ...

Figure 3.11: Enumeration Example for DPccp

If we meet all these requirements, the algorithm DPccp is easily specified:

iterate over all csg-cmp-pairs (S1 , S2 ) and consider joining the best plans as-
sociated with them. Figure 3.10 shows the pseudocode. The first steps of an
example enumeration are shown in Figure 3.11. Thick lines mark the connect-
ed subsets while thin lines mark possible join edges. Note that the algorithm
explicitly exploits join commutativity. This is due to our enumeration algo-
rithm developed below. If (S1 , S2 ) is a csg-cmp-pair, then either (S1 , S2 ) or
(S2 , S1 ) will be generated, but never both of them. An alternative is to modify
CreateJoinTree to take care of commutativity.
We proceed as follows. Next we discuss an algorithm enumerating non-
empty connected subsets S1 of {R0 , . . . , Rn−1 }. Then, we show how to enumer-
ate the complements S2 such that (S1 , S2 ) is a csg-cmp-pair.
76 CHAPTER 3. JOIN ORDERING

Let us start the exposition by fixing some notations. Let G = (V, E) be

an undirected graph. For a node v ∈ V define the neighborhood IN(v) of v as
IN(v) := {v 0 |(v, v 0 ) ∈ E}. For a subset S ⊆ V of V we define the neighborhood of
S as IN(S) := ∪v∈S IN(v) \ S. The neighborhood of a set of nodes thus consists
of all nodes reachable by a single edge. Note that for all S, S 0 ⊂ V we have
IN(S ∪ S 0 ) = (IN(S) ∪ IN(S 0 )) \ (S ∪ S 0 ). This allows for an efficient bottom-up
calculation of neighborhoods.
The following statement gives a hint on how to construct an enumeration
procedure for connected subsets. Let S be a connected subset of an undirected
graph G and S 0 be any subset of IN(S). Then S ∪ S 0 is connected. As a
consequence, a connected subset can be enlarged by adding any subset of its
neighborhood.
We could generate all connected subsets as follows. For every node vi ∈ V
we perform the following enumeration steps: First, we emit {vi } as a connected
subset. Then, we expand {vi } by calling a routine that extends a given connect-
ed set to bigger connected sets. Let the routine be called with some connected
set S. It then calculates the neighborhood IN(S). For every non-empty subset
N ⊆ IN(S), it emits S 0 = S ∪ N as a further connected subset and recursively
calls itself with S 0 . The problem with this routine is that it produces duplicates.
This is the point where the breadth-first numbering comes into play. Let
V = {v0 , . . . , vn−1 }, where the indices are consistent with a breadth-first num-
bering produced by a breadth-first search starting at node v0 [205]. The idea
is to use the numbering to define an enumeration order: In order to avoid du-
plicates, the algorithm enumerates connected subgraphs for every node vi , but
restricts them to contain no vj with j < i. Using the definition Bi = {vj |j ≤ i},
the pseudocode looks as follows:
EnumerateCsg
Input: a connected query graph G = (V, E)
Precondition: nodes in V are numbered according to a breadth-first search
Output: emits all subsets of V inducing a connected subgraph of G
for all i ∈ [n − 1, . . . , 0] descending {
emit {vi };
EnumerateCsgRec(G, {vi }, Bi );
}

EnumerateCsgRec(G, S, X)
N = IN(S) \ X;
for all S 0 ⊆ N , S 0 6= ∅, enumerate subsets first {
emit (S ∪ S 0 );
}
for all S 0 ⊆ N , S 0 6= ∅, enumerate subsets first {
EnumerateCsgRec(G, (S ∪ S 0 ), (X ∪ N ));
}

Let us consider an example. Figure 3.12 contains a query graph whose

nodes are numbered in a breadth-first fashion. The calls to EnumerateCsgRec
3.2. DETERMINISTIC ALGORITHMS 77

R1 R2 R3

Figure 3.12: Sample graph to illustrate EnumerateCsgRec

EnumerateCsgRec
S X N emit/S
{4} {0, 1, 2, 3, 4} ∅
{3} {0, 1, 2, 3} {4}
{3, 4}
{2} {0, 1, 2} {3, 4}
{2, 3}
{2, 4}
{2, 3, 4}
{1} {0, 1} {4}
{1, 4}
→ {1, 4} {0, 1, 4} {2, 3}
{1, 2, 4}
{1, 3, 4}
{1, 2, 3, 4}
{0} {0} {1, 2, 3}
{0, 1}
{0, 2}
{0, 3}
{0, 1, 2}
{0, 1, 3}
{0, 2, 3}
{0, 1, 2, 3}
→ {0, 1} {0, 1, 2, 3} {4}
{0, 1, 4}
→ {0, 2} {0, 1, 2, 3} {4}
{0, 2, 4}

Figure 3.13: Call sequence for Figure 3.12

are contained in the table in Figure 3.13. In this table, S and X are the
arguments of EnumerateCsgRec. N is the local variable after its initialization.
The column emit/S contains the connected subset emitted, which then becomes
the argument of the recursive call to EnumerateCsgRec (labelled by →). Since
listing all calls is too lengthy, only a subset of the calls is listed.
Generating the connected subsets is an important first step but clearly not
78 CHAPTER 3. JOIN ORDERING

sufficient: we have to generate all csg-cmp-pairs. The basic idea to do so is

as follows. Algorithm EnumerateCsg is used to create the first component S1
of every csg-cmp-pair. Then, for each such S1 , we generate all its complement
components S2 . This can be done by calling EnumerateCsgRec with the correct
parameters. Remember that we have to generate every csg-cmp-pair once and
only once.
To achieve this, we use a similar technique as for connected subsets, using
the breadth-first numbering to define an enumeration order: we consider only
sets S2 in the complement of S1 (with (S1 , S2 ) being a csg-cmp-pair) such that
S2 contains only vj with j larger than any i with vi ∈ S1 . This avoids the
generation of duplicates.
We need some definitions to state the actual algorithm. Let S1 ⊆ V be
a non-empty subset of V . Then, we define min(S1 ) := min({i|vi ∈ S1 }).
This is used to extract the starting node from which S1 was constructed (see
Lemma ??). Let W ⊂ V be a non-empty subset of V . Then, we define
Bi (W ) := {vj |vj ∈ W, j ≤ i}. Using this notation, the algorithm to construct
all S2 for a given S1 such that (S1 , S2 ) is a csg-cmp-pair looks as follows:
EnumerateCmp
Input: a connected query graph G = (V, E), a connected subset S1
Precondition: nodes in V are numbered according to a breadth-first search
Output: emits all complements S2 for S1 such that (S1 , S2 ) is a csg-cmp-pair
X = Bmin(S1 ) ∪ S1 ;
N = IN(S1 ) \ X;
for all (vi ∈ N by descending i) {
emit {vi };
EnumerateCsgRec(G, {vi }, X ∪ (Bi ∩ N ));
}

Algorithm EnumerateCmp considers all neighbors of S1 . First, they are used to

determine those S2 that contain only a single node. Then, for each neighbor of
S1 , it recursively calls EnumerateCsgRec to create those S2 that contain more
than a single node. Note that here both nodes concerning the enumeration
of S1 (Bmin(S1 ) ∪ S1 ) and nodes concerning the enumeration of S2 (N ) have
to be considered in order to guarantee a correct enumeration. Otherwise the
combined algorithm would emit (commutative) duplicates.
Let us consider an example for algorithm EnumerateCmp. The underlying
graph is again the one shown in Fig. 3.12. Assume EnumerateCmp is called with
S1 = {R1 }. In the first statement, the set {R0 , R1 } is assigned to X. Then, the
neighborhood is calculated. This results in

N = {R0 , R4 } \ {R0 , R1 } = {R4 }.

Hence, {R4 } is emitted and together with {R1 }, it forms the csg-cmp-pair
({R1 }, {R4 }). Then, the recursive call to EnumerateCsgRec follows with ar-
guments G, {R4 }, and {R0 , R1 , R4 }. Subsequent EnumerateCsgRec generates
the connected sets {R2 , R4 }, {R3 , R4 }, and {R2 , R3 , R4 }, giving three more
csg-cmp-pairs.
3.2. DETERMINISTIC ALGORITHMS 79

3.2.5 Memoization
Whereas dynamic programming constructs the join trees iteratively from small
trees to larger trees, i.e. works bottom up, memoization works recursively. For a
given set of relations S, it produces the best join tree for S by recursively calling
itself for every subset S1 of S and considering all join trees between S1 and its
complement S2 . The best alternative is memoized (hence the name). The rea-
son is that two (even different) (sub-) sets of all relations may very well have the
common subsets. For example, {R1 , R2 , R3 , R4 , R5 } and {R2 , R3 , R4 , R5 , R6 }
have the common subset {R2 , R3 , R4 , R5 }. In order to avoid duplicate work,
memoization is essential.
In the following variant of memoization, we explore the search space of all
bushy trees and consider cross products. We split the functionality across two EX
functions. The first one initializes the BestTree data structure with single
relation join trees for Ri and then calls the second one. The second one is the
core memoization procedure which calls itself recursively.

MemoizationJoinOrdering(R)
Input: a set of relations R
Output: an optimal join tree for R
for (i = 1; i <= n; ++i) {
BestTree({Ri }) = Ri ;
}
return MemoizationJoinOrderingSub(R);

MemoizationJoinOrderingSub(S)
Input: a (sub-) set of relations S
Output: an optimal join tree for S
if(NULL == BestTree(S)) {
for all S1 ⊂ S do {
S2 = S \ S1 ;
CurrTree = CreateJoinTree(MemoizationJoinOrderingSub(S1 ), MemoizationJoinOrderingSub(
if (BestTree(S) == NULL || cost(BestTree(S)) > cost(CurrTree)) {
BestTree(S) = CurrTree;
}
}
}
return BestTree(S);

Again, pruning techniques can help to speed up plan generation [787]. ToDo?

3.2.6 Join Ordering by Generating Permutations

For any set of cost functions, we can directly generate permutations. Gen-
erating all permutations is clearly too expensive for more than a couple of
relations. However, we can safely neglect some of them. Consider the join
80 CHAPTER 3. JOIN ORDERING

sequence R1 R2 R3 R4 . If we know that R1 R3 R2 is cheaper than R1 R2 R3 , we

do not have to consider R1 R2 R3 R4 . The idea of the following algorithm is to
construct permutations by successively adding relations. Thereby, an extended
sequence is only explored if exchanging the last two relations does not result in
a cheaper sequence.

ConstructPermutations(Query Specification)
Input: query specification for relations {R1 , . . . , Rn }
Output: optimal left-deep tree
BestPermutation = NULL;
Prefix = ;
Rest = {R1 , . . . , Rn };
ConstructPermutationsSub(Prefix, Rest);
return BestPermutation

ConstructPermutationsSub(Prefix, Rest)
Input: a prefix of a permutation and the relations to be added (Rest)
Ouput: none, side-effect on BestPermutation
if (Rest == ∅) {
if (BestPermutation == NULL || cost(Prefix) < cost(BestPermutation)) {
BestPermutation = Prefix;
}
return
}
foreach (Ri , Rj ∈ Rest) {
if (cost(Prefix ◦ hRi , Rj i) ≤ cost(Prefix ◦ hRj , Ri i)) {
ConstructPermutationsSub(Prefix ◦ hRi i, Rest \ {Ri });
}
if (cost(Prefix ◦ hRj , Ri i) ≤ cost(Prefix ◦ hRi , Rj i)) {
ConstructPermutationsSub(Prefix ◦ hRj i, Rest \ {Rj });
}
}
return

The algorithm can be made more efficient, if the foreach loop considers only
a single relation and performs the swap test with this relation and the last
relation occurring in Prefix.
The algorithm has two main advantages over dynamic programming and
memoization. The first advantage is that it needs only linear space opposed
to exponential space for the two mentioned alternatives. The other main
advantage over dynamic programming is that it generates join trees early,
whereas with dynamic programming we only generate a plan after the whole
search space has been explored. Thus, if the query contains too many joins—
that is, the search space cannot be fully explored in reasonable time and
space—dynamic programming will not generate any plan at all. If stopped,
3.2. DETERMINISTIC ALGORITHMS 81

ConstructPermutations will not necessarily compute the best plan, but still
some plans have been investigated. This allows us to stop it after some time
limit has exceeded. The time limit itself can be fixed, like 100 ms, or variable,
like 5% of the execution time of the best plan found so far.
The predicates in the if statement can be made more efficient if a (local)
ranking function is available. Further speed-up of the algorithm can be achieved
if additionally the idea of memoization is applied (of course, this jeopardizes
the small memory footprint).
The following variant might be interesting if one is willing to go from linear
space consumption to quadratic space consumption. The original algorithm
is then started n times, once for each relation as a starting relation. The n
different instantiations then have to run interleaved. This variant reduces the
dependency on the starting relation.
Worst Case Analysis ToDo/EX
Pruning/memoization/propagation ToDo/EX

3.2.7 A Dynamic Programming based Heuristics for Chain Queries

In Section 3.1.6, we saw that the complexity of producing optimal left-deep
trees possibly containing cross products for chain queries is an open problem.
However, the case does not seem to be hopeless. In fact, Scheufele and Mo-
erkotte present two algorithms [754, 756] for this problem. For one algorithm,
it can be proven that it has polynomial runtime, for the other, it can be proven
that it produces the optimal join tree. However, for none of them both could
be proven so far.

Basic Definitions and Lemmata

An instance of the join-ordering problem for chain queries (or a chain query
for short) is fully described by the following parameters. First, n relations
R1 , . . . , Rn are given. The size of relation Ri (1 ≤ i ≤ n) is denoted by |Ri | or
nRi . Second, the query graph G on the set of relations R1 , . . . , Rn must be a
chain. That is, its edges are {(Ri , Ri+1 ) | 1 ≤ i < n}:

R1 — R2 — . . . — Rn

For every edge (Ri , Ri+1 ), there is an associated selectivity fi,i+1 = |Ri B
Ri+1 |/|Ri × Ri+1 |. We define all other selectivities fi,j = 1 for |i − j| =6 1.
They correspond to cross products.
In this section we consider only left-deep processing trees. However, we
allow them to contain cross products. Hence, any permutation is a valid join
tree. There is a unique correspondence not only between left-deep join trees
but also between consecutive parts of a permutation and segments of a left-
deep tree. Furthermore, if a segment of a left-deep tree does not contain cross
products, it uniquely corresponds to a consecutive part of the chain in the query
graph. In this case, we also speak of (sub)chains or connected (sub)sequences.
We say that two relations Ri and Rj are connected if they are adjacent in G;
more generally, two sequences s and t are connected if there exist relations Ri
82 CHAPTER 3. JOIN ORDERING

in s and Rj in t such that Ri and Rj are connected. A sequence of relations s

is connected if for all subsequences s1 and s2 satisfying s = s1 s2 it holds that
s1 is connected to s2 .
Given a chain query, we ask for a permutation s = r1 . . . rn of the n relations
(i.e. there is a permutation π such that ri = Rπ(i) for 1 ≤ i ≤ n) that produces
minimal costs under the cost function Cout .
Remember that the dynamic programming approach considers n2n−1 −n(n+
1)/2 alternatives for left-deep processing trees with cross products—independently
of the query graph and the cost function. The question arises whether it is pos-
sible to lower the complexity in case of simple chain queries.
The IKKBZ algorithm solves the join ordering problem for tree queries by
decomposing the problem into polynomially many subproblems which are sub-
ject to tree-like precedence constraints. The precedence constraints ensure that
the cost functions of the subproblems now have the ASI property. The remain-
ing problem is to optimize the constrained subproblems under the simpler cost
function. Unfortunately, this approach does not work in our case, since no such
decomposition seems to exist.
Let us introduce some notions used for the algorithms. We have to gener-
alize the rank used in the IKKBZ algorithm to relativized ranks. We start by
relativizing the cost function. The costs of a sequence s relative to a sequence
u are defined as

Cu () := 0
Cu (Ri ) := 0 if u =
Y
Cu (Ri ) := ( fj,i )ni if u 6=
Rj <uRi Ri

Cu (s1 s2 ) := Cu (s1 ) + Tu (s1 ) ∗ Cus1 (s2 )

with

Tu () := 1
Y Y
Tu (s) := ( fj,i ) ∗ ni
Ri ∈s Rj <us Ri

Here, Ri <s Rj is true if and only if Ri appears before Rj in s. As usual, empty

products evaluate to 1. Several things should be noted. First, Cus (t) = Cu (t)
holds if there is no connection between relations in s and t. Second, T (Ri ) =
|Ri | and T (s) = |s|. That is, Tu generalizes the size of a single relation or of a
sequence of relations. Third, note that Cu () = 0 for all u but C (s) = 0 only
if s does not contain more than one relation. The special case that C (R) = 0
for a single relation R causes some problems in the homogeneity of definitions
and proofs. Hence, we abandon this case from all definitions and lemmata of
this section. This will not be repeated in every definition and lemma, but will
implicitly be assumed. Further, the two algorithms will be presented in two
versions. The first version is simpler and relies on a modified cost function C 0 ,
3.2. DETERMINISTIC ALGORITHMS 83

and only the second version will apply to the original cost function C. As we
will see, C 0 differs from C in exactly the problematic case in which it is defined
as Cu0 (Ri ) := |Ri |. Now, C0 (s) = 0 holds if and only if s = holds. Within
subsequent definitions and lemmata, C can also be replaced by C 0 without
changing their validity. Last, we abbreviate C by C for convenience.

Example 1: Consider a chain query involving the relations R1 , R2 , R3 . The

parameters are |R1 | = 1, |R2 | = 100, |R3 | = 10 and f1,2 = f2,3 = 0.9. The
expected size of the query result is independent of the ordering of the relations.
Hence, we have
T (R1 R2 R3 ) = · · · = T (R3 R2 R1 ) = 100 ∗ 10 ∗ 1 ∗ .9 ∗ .9 = 810.
There are 6 possible orderings of the relations with the following costs:
C(R1 R2 R3 ) = 1 ∗ 100 ∗ 0.9 + 1 ∗ 100 ∗ 10 ∗ 0.9 ∗ 0.9 = 900
C(R1 R3 R2 ) = 1 ∗ 10 + 1 ∗ 10 ∗ 100 ∗ 0.9 ∗ 0.9 = 820
C(R2 R3 R1 ) = 100 ∗ 10 ∗ 0.9 + 100 ∗ 10 ∗ 1 ∗ 0.9 ∗ 0.9 = 1710
C(R2 R1 R3 ) = C(R1 R2 R3 )
C(R3 R1 R2 ) = C(R1 R3 R2 )
C(R3 R2 R1 ) = C(R2 R3 R1 )
Note that the cost function is invariant with respect to the order of the first two
relations. The minimum over all costs is 820, and the corresponding optimal
join ordering is R1 R3 R2 .
2
Using the relativized cost function, we can define the relativized rank.
Definition 3.2.8 (rank) The rank of a sequence s relative to a non-empty
sequence u is given by
Tu (s) − 1
ranku (s) :=
Cu (s)
In the special case that s consists of a single relation Ri , the intuition behind
the rank function becomes transparent. Let fi be the product of the selectivities
between relations in u and Ri . Then ranku (Ri ) = fif|R i |−1
i |Ri |
. Hence, the rank
x−1
becomes a function of the form f (x) = x . This function is monotonously
increasing in x for x > 0. The argument to the function f (x) is (for the
computation of the size of a single relation Ri ) fi |Ri |. But this is the factor by
which the next intermediate result will increase (or decrease). Since we sum up
intermediate results, this is an essential number. Furthermore, it follows from
the monotonicity of f (x) that ranku (Ri ) ≤ ranku (Rj ) if and only if fi |Ri | ≤
fj |Rj | where fj is the product of all selectivities between Rj and relations in u.

Example 1 (cont’d): Supposing the query given in Example 1, the optimal

sequence R1 R3 R2 gives rise to the following ranks.
TR1 (R2 )−1 100∗0.9−1
rankR1 (R2 ) = CR1 (R2 ) = 100∗0.9 ≈ 0.9888
TR1 (R3 )−1
rankR1 (R3 ) = C R1 (R3 )
= 10∗1.0−1
10∗1.0 = 0.9
TR1 R3 (R2 )−1
rankR1 R3 (R2 ) = CR R (R2 ) = 100∗0.9∗0.9−1
100∗0.9∗0.9 ≈ 0.9877
1 3
84 CHAPTER 3. JOIN ORDERING

Hence, within the optimal sequence, the relation with the smallest rank (here
R3 , since rankR1 (R3 ) < rankR1 (R2 )) is preferred. As the next lemma will
show, this is no accident.
2
Using the rank function, the following lemma can be proved.

Lemma 3.2.9 For sequences

S = r1 · · · rk−1 rk rk+1 rk+2 · · · rn

S 0 = r1 · · · rk−1 rk+1 rk rk+2 · · · rn

the following holds:

C(S) ≤ C(S 0 ) ⇔ ranku (rk ) ≤ ranku (rk+1 )

where u = r1 · · · rk−1 . Equality only holds if it holds on both sides.

Example 1 (cont’d): Since the ranks of the relations in Example 1 are

ordered with ascending ranks, Lemma 3.2.9 states that, whenever we exchange
two adjacent relations, the costs cannot decrease. In fact, we observe that
C(R1 R3 R2 ) ≤ C(R1 R2 R3 ). 2
An analogous lemma still holds for two unconnected subchains:

Lemma 3.2.10 Let u, x and y be three subchains where x and y are not inter-
connected. Then we have:

C(uxy) ≤ C(uyx) ⇔ ranku (x) ≤ ranku (y)

Equality only holds if it holds on both sides.

Next, we define the notion of a contradictory chain, which will be essential

to the algorithms. The subsequent lemmata will allow us to cut down the search
space to be explored by any optimization algorithm.

Definition 3.2.11 (contradictory pair of subchains) Let u, x, y be nonemp-

ty sequences. We call (x, y) a contradictory pair of subchains if and only if

Cu (xy) ≤ Cu (yx) ∧ ranku (x) > rankux (y)

A special case occurs when x and y are single relations. Then the above condi-
tion simplifies to
rankux (y) < ranku (x) ≤ ranku (y)

To explain the intuition behind the definition of contradictory subchains, we

need another example.
3.2. DETERMINISTIC ALGORITHMS 85

Example 2: Suppose a chain query involving R1 , R2 , R3 is given. The relation

sizes are |R1 | = 1, |R2 | = |R3 | = 10 and the selectivities are f1,2 = 0.5, f2,3 =
0.2. Consider the sequences R1 R2 R3 and R1 R3 R2 , which differ in the order of
the last two relations. We have

rankR1 (R2 ) = 0.8

rankR1 R2 (R3 ) = 0.0
rankR1 (R3 ) = 0.9
rankR1 R3 (R2 ) = 0.5

and

C(R1 R2 R3 ) = 15
C(R1 R3 R2 ) = 20

Hence,

rankR1 (R2 ) > rankR1 R2 (R3 )

rankR1 (R3 ) > rankR1 R3 (R2 )
C(R1 R2 R3 ) < C(R1 R3 R2 )

and (R2 , R3 ) is a contradictory pair within R1 R2 R3 . Now the use of the term
contradictory becomes clear: the costs do not behave as could be expected from
the ranks. 2
The next (obvious) lemma states that contradictory chains are necessarily
connected.
Lemma 3.2.12 If there is no connection between two subchains x and y, then
they cannot build a contradictory pair (x, y).
Now we present the fact that between a contradictory pair of relations, there
cannot be any other relation not connected to them without increasing cost.
Lemma 3.2.13 Let S = usvtw be a sequence. If there is no connection between
relations in s and v and relations in v and t, and ranku (s) ≥ rankus (t), then
there exists a sequence S 0 not having higher costs, where s immediately precedes
t.

Example 3: Consider five relations R1 , . . . , R5 . The relation sizes are |R1 | =

1, |R2 | = |R3 | = |R4 | = 8, and |R5 | = 2. The selectivities are f1,2 = 12 ,
f2,3 = 14 , f3,4 = 81 , and f4,5 = 12 . Relation R5 is not connected to relations R2
and R3 . Further, within the sequence R1 R2 R5 R3 R4 relations R2 and R3 have
contradictory ranks: rankR1 (R2 ) = 4−1 3 2−1
4 = 4 and rankR1 R2 R5 (R3 ) = 2 = 2 .
1

Hence, at least one of R1 R5 R2 R3 R4 and R1 R2 R3 R5 R4 must be of no greater

cost than R1 R2 R5 R3 R4 . This is indeed the case:

C(R1 R2 R3 R5 R4 ) = 4 + 8 + 16 + 8 = 36
C(R1 R2 R5 R3 R4 ) = 4 + 8 + 16 + 8 = 36
C(R1 R5 R2 R3 R4 ) = 2 + 8 + 16 + 8 = 34
86 CHAPTER 3. JOIN ORDERING

2
The next lemma shows that, if there exist two sequences of single rank-sorted
relations, then their costs as well as their ranks are necessarily equal.
Lemma 3.2.14 Let S = x1 · · · xn and S 0 = y1 · · · yn be two different rank-
sorted chains containing exactly the relations R1 , . . . , Rn , i.e.
rankx1 ···xi−1 (xi ) ≤ rankx1 ···xi (xi+1 ) for all 1 ≤ i ≤ n,
ranky1 ···yi−1 (yi ) ≤ ranky1 ···yi (yi+1 ) for all 1 ≤ i ≤ n,
then S and S 0 have equal costs and, furthermore,
rankx1 ···xi−1 (xi ) = ranky1 ···yi−1 (yi ) for all 1 < i ≤ n
One could conjecture that the following generalization of Lemma 3.2.14 is
true, although no one has proved it so far.
Conjecture 3.2.1 Let S = x1 · · · xn and S 0 = y1 · · · ym be two different rank-
sorted chains for the relations R1 . . . , Rn where the x0i s and yi0 s are subsequences
such that
rankx1 ···xi−1 (xi ) ≤ rankx1 ···xi (xi+1 ) for all 1 ≤ i < n,
ranky1 ···yi−1 (yi ) ≤ ranky1 ···yi (yi+1 ) for all 1 ≤ i < m,
and the subsequences xi and yj are all optimal (with respect to the fixed prefixes
x1 . . . xi−1 and y1 . . . yj−1 ), then S and S 0 have equal costs.
Consider the problem of merging two optimal unconnected chains. If we
knew that the ranks of relations in an optimal chain are always sorted in as-
cending order, we could use the classical merge procedure to combine the two
chains. The resulting chain would also be rank-sorted in ascending order and,
according to Lemma 3.2.14, it would be optimal. Unfortunately, this does not
work, since there are optimal chains whose ranks are not sorted in ascending
order: those containing sequences with contradictory ranks.
Now, as shown in Lemma 3.2.13, between contradictory pairs of relations
there cannot be any other relation not connected to them. Hence, in the merging
process, we have to take care that we do not merge a contradictory pair of
relations with a relation not connected to the pair. In order to achieve this,
we apply the same trick as in the IKKBZ algorithm: we tie the relations of a
contradictory subchain together by building a compound relation. Assume that
we tie together the relations r1 , . . . , rn to a new relation r1,...,n . Then we define
the size of r1,...,n as |r1,...,n | = |r1 B . . . B rn | Further, if some ri (1 ≤ i ≤ n)
does have a connection to some rk 6∈ {r1 , . . . , rn } then we define the selectivity
factor fr1,...,n ,rk between rk and r1,...,n as fr1,...,n ,rk = fi,k .
If we tie together contradictory pairs, the resulting chain of compound re-
lations still does not have to be rank-sorted with respect to the compound
relations. To overcome this, we iterate the process of tying contradictory pairs
of compound relations together until the sequence of compound relations is
rank-sorted, which will eventually be the case. That is, we apply the normal-
ization as used in the IKKBZ algorithm. However, we have to reformulate it
for relativized costs and ranks:
3.2. DETERMINISTIC ALGORITHMS 87

Normalize(p,s)
while (there exist subsequences u, v (u 6= ) and
compound relations x, y such that s = uxyv
and Cpu (xy) ≤ Cpu (yx)
and rankpu (x) > rankpux (y)) {
replace xy by a compound relation (x, y);
}
return (p, s);

The compound relations in the result of the procedure Normalize are called
contradictory chains. A maximal contradictory subchain is a contradictory sub-
chain that cannot be made longer by further tying steps. Resolving the tyings
introduced in the procedure normalize is called de-normalization. It works the
same way as in the IKKBZ algorithm. The cost, size and rank functions can
now be extended to sequences containing compound relations in a straightfor-
ward way. We define the cost of a sequence containing compound relations to
be identical with the cost of the corresponding de-normalized sequence. The
size and rank functions are defined analogously.
The following simple observation is central to the algorithms: every chain
can be decomposed into a sequence of adjacent maximal contradictory sub-
chains. For convenience, we often speak of chains instead of subchains and of
contradictory chains instead of maximal contradictory subchains. The mean-
ing should be clear from the context. Further, we note that the decomposi-
tion into adjacent maximal contradictory subchains is not unique. For exam-
ple, consider an optimal subchain r1 r2 r3 and a sequence u of preceding rela-
tions. If ranku (r1 ) > rankur1 (r2 ) > rankur1 r2 (r3 ) one can easily show that
both (r1 , (r2 , r3 )) and ((r1 , r2 ), r3 ) are contradictory subchains. Nevertheless,
this ambiguity is not important since in the following we are only interest-
ed in contradictory subchains which are optimal . In this case, the condition
Cu (xy) ≤ Cu (yx) is certainly true and can therefore be neglected. One can
show that for the case of optimal subchains the indeterministically defined nor-
malization process is well-defined, that is, if S is optimal, normalize(P,S) will
always terminate with a unique “flat” decomposition of S into maximal contra-
dictory subchains (flat means that we remove all but the outermost parenthesis,
e.g. (R1 R2 )(((R5 R4 )R3 )R6 ) becomes (R1 R2 )(R5 R4 R3 R6 )).
The next two lemmata and the conjecture show a possible way to overcome
the problem that if we consider cross products, we have an unconstrained or-
dering problem and the idea of Monma and Sidney as exploited in the IKKBZ
algorithm is no longer applicable. The next lemma is a direct consequence of
the normalization procedure.

Lemma 3.2.15 Let S = s1 . . . sm be an optimal chain consisting of the maxi-

mal contradictory subchains s1 , . . . , sm (as determined by the function normalize).
Then
rank(s1 ) ≤ ranks1 (s2 ) ≤ ranks1 s2 (s3 )
88 CHAPTER 3. JOIN ORDERING

≤ · · · ≤ ranks1 ...sm−1 (sm ),

in other words, the (maximal) contradictory subchains in an optimal chain are
always sorted by ascending ranks.

The next result shows how to build an optimal sequence from two optimal
non-interconnected sequences.

Lemma 3.2.16 Let x and y be two optimal sequences of relations where x and
y are not interconnected. Then the sequence obtained by merging the maximal
contradictory subchains in x and y (as obtained by normalize) according to
their ascending rank is optimal.

Merging two sequences in the way described in Lemma 3.2.16 is a funda-

mental process. We henceforth refer to it by simply saying that we merge by
the ranks.
We strongly conjecture that the following generalization of Lemma 3.2.14
is true, although it is yet unproven. It uses the notion of optimal recursive
decomposable subchains defined in the next subsection.

Conjecture 3.2.2 Consider two sequences S and T containing exactly the re-
lations R1 ,. . . ,Rn . Let S = s1 . . . sk and T = t1 . . . tl be such that each of the
maximal contradictory subchains si , i = 1, . . . , k and tj , j = 1, . . . , l are optimal
recursively decomposable. Then S and T have equal costs.

The first algorithm

We first use a slightly modified cost function C 0 , which additionally respects
the size of the first relation in the sequence, i.e. C and C 0 relate via

0 C(s) + |nR |, if u = and s = Rs0
Cu (s) =
Cu (s), otherwise
This cost function can be treated in a more elegant way than C. The new rank
function is now defined as ranku (s) := (Tu (s) − 1)/Cu0 (s). Note that the rank
function is now defined even if u = and s is a single relation. The size function
remains unchanged. At the end of this subsection, we describe how our results
can be adapted to the original cost function C.
The rank of a contradictory chain depends on the relative position of the
relations that are directly connected to it. For example, the rank of the con-
tradictory subchain (R5 R3 R4 R2 ) depends on the position of the neighbouring
relations R1 and R6 relative to (R5 R3 R4 R2 ). That is, whether they appear be-
fore or after the sequence (R5 R3 R4 R2 ). Therefore, we introduce the following
fundamental definitions:

Definition 3.2.17 (neighbourhood) We call the set of relations that are di-
rectly connected to a subchain (with respect to the query graph G) the complete
neighbourhood of that subchain. A neighbourhood is a subset of the complete
neighbourhood. The complement of a neighbourhood u of a subchain s is defined
as v \ u, where v denotes the complete neighbourhood of s.
3.2. DETERMINISTIC ALGORITHMS 89

Note that the neighbourhood of a subchain s within a larger chain us is unique-

ly determined by the subsequence u of relations preceding it. For convenience,
we will often use sequences of preceding relations to specify neighbourhoods.
We henceforth denote a pair consisting of a connected sequence s and a neigh-
bourhood u by [s]u .

Definition 3.2.18 (contradictory subchain, extent) A contradictory sub-

chain [s]u is inductively defined as follows.

1. For a single relation s, [s]u is a contradictory subchain.

2. There is a decomposition s = vw such that (v, w) is a contradictory pair

with respect to the preceding subsequence u and both [v]u and [w]uv are
contradictory subchains themselves.

The extent of a contradictory chain [s]u is defined as the pair consisting of

the neighbourhood u and the set of relations occurring in s. Since contradicto-
ry subchains are connected, the set of occurring relations has always the form
{Ri , Ri+1 , . . . , Ri+l } for some 1 ≤ i ≤ n, 0 ≤ l ≤ n − i. An optimal contra-
dictory subchain to a given extent is a contradictory subchain with lowest cost
among all contradictory subchains of the same extent.

The number of different extents of contradictory subchains for a chain query

of n relations is 2n2 − 2n + 1. Each contradictory chain can be completely
recursively decomposed into adjacent pairs of connected subchains. Subchains
with this property are defined next (similar types of decompositions occur in
[429, 788]).

Definition 3.2.19 ((optimal) recursively decomposable subchain) A re-

cursively decomposable subchain [s]u is inductively defined as follows.

1. If s is a single relation, then [s]u is recursively decomposable.

2. There is a decomposition s = vw such that v is connected to w and both

[v]u and [w]uv are recursively decomposable subchains.

The extent of a recursively decomposable chain is defined in the same way as

for contradictory chains. Note that every contradictory subchain is recursively
decomposable. Consequently, the set of all contradictory subchains for a certain
extent is a subset of all recursively decomposable subchains of the same extent.

Example 4: Consider the sequence of relations

s = R2 R4 R3 R6 R5 R1 .

Using parentheses to indicate the recursive decompositions, we have the follow-

ing two possibilities
(((R2 (R4 R3 ))(R6 R5 ))R1 )
((R2 ((R4 R3 )(R6 R5 )))R1 )
90 CHAPTER 3. JOIN ORDERING

The extent of the recursively decomposable subchain

R4 R3 R6 R5 of s is ({R2 }, {R3 , R4 , R5 , R6 }). 2
The number of different recursively decomposable chains involving the rela-
tions R1 , . . . , Rn is rn , where rn denotes the n-th Schröder number [788]. Hence,
the
P totaln−2number
of recursively decomposable chains is rn + 2(n − 1)rn−1 +
4 n−2
i=1 i ri . It can be shown that
√
C(2 + 8)n
rn ≈
n3/2
q √
where C = 1/2 2 π2−4 . Using Stirling’s formula for n! it is easy to show
that limn→∞ rn!n = 0. Thus, the probability of a random permutation to be
recursively decomposable strives to zero for large n.
An optimal recursively decomposable subchain to a given extent is a recur-
sively decomposable subchain with lowest cost among all recursively decompos-
able subchains of the same extent. There is an obvious dynamic programming
algorithm to compute optimal recursive decomposable subchains. It is not hard
to see that Bellman’s optimality principle [597, 205] holds and every optimal
recursively decomposable subchain can be decomposed into smaller optimal
recursively decomposable subchains.

Example 5: In order to compute an optimal recursively decomposable sub-

chain for the extent
({R2 , R7 }, {R3 , R4 , R5 , R6 })
the algorithm makes use of optimal recursively decomposable subchains for the
extents

({R2 }, {R3 }) ({R7 , R3 }, {R4 , R5 , R6 })

({R2 }, {R3 , R4 }) ({R7 , R4 }, {R5 , R6 })
({R2 }, {R3 , R4 , R5 }) ({R5 , R7 }, {R6 })
({R7 }, {R4 , R5 , R6 }) ({R2 , R4 }, {R3 })
({R7 }, {R5 , R6 ) ({R2 , R5 }, {R3 , R4 })
({R7 }, {R6 }) ({R2 , R6 }, {R3 , R4 , R5 })

which have been computed in earlier steps2 . A similar dynamic programming

algorithm can be used to determine optimal contradictory subchains. 2
Let E be the set of all possible extents. We define the following partial order
P = (E, ≺) on E. For all extents e1 , e2 ∈ E, we have e1 ≺ e2 if and only if
e1 can be obtained by splitting the extent e2 . For example, ({R7 }, {R5 , R6 }) ≺
({R2 , R7 }, {R3 , R4 , R5 , R6 }). The set of maximal extents M then corresponds
to a set of incomparable elements (antichain) in P such that for all extents e
enumerated so far, there is an extent e0 ∈ M with e ≺ e0 .
Now, since every optimal join sequence has a representation as a sequence of
contradictory subchains, we only have to determine this representation. Con-
sider a contradictory subchain c in an optimal join sequence s. What can we say
2
The splitting of extents induces a partial order on the set of extents.
3.2. DETERMINISTIC ALGORITHMS 91

about c? Obviously, c has to be optimal with respect to the neighbourhood de-

fined by the relations preceding c in s. Unfortunately, identifying contradictory
subchains that are optimal sequences seems to be as hard as the whole problem
of optimizing chain queries. Therefore, we content ourselves with the following
weaker condition which may lead to multiple representations. Nevertheless, it
seems to be the strongest condition for which all subchains satisfying the con-
dition can be computed in polynomial time. The condition says that s should
be optimal both with respect to all contradictory chains of the same extent as s
and with respect to all recursively decomposable subchains of the same extent.
So far it is not clear whether these conditions lead to multiple representations.
Therefore, we have no choice but to enumerate all possible representations and
select the one with minimal costs. Next we describe the first algorithm.
Algorithm Chain-I’:
1. Use dynamic programming to determine all optimal contradictory sub-
chains.
This step can be made faster by keeping track of the set M of all maximal
extents (with respect to the partial order induced by splitting extents).
2. Determine all optimal recursively decomposable subchains for all extents
included in some maximal extent in M .
3. Compare the results from steps 1 and 2 and retain only matching sub-
chains.
4. Sort the contradictory subchains according to their ranks.
5. Eliminate contradictory subchains that cannot be part of a solution.
6. Use backtracking to enumerate all sequences of rank-ordered optimal con-
tradictory subchains and keep track of the sequence with lowest cost.
In step 5 of the algorithm, we eliminate contradictory subchains that do not
contribute to a solution. Note that the contradictory subchains in an optimal
sequence are characterized by the following two conditions.
1. The extents of all contradictory subchains in the representation build a
partition of the set of all relations.

2. The neighbourhoods of all contradictory subchains are consistent with the

relations occurring at earlier and later positions in the sequence.
Note that any contradictory subchain occurring in the optimal sequence (except
at the first and last positions) necessarily has matching contradictory subchains
preceding and succeeding it in the list. In fact, every contradictory subchain X
occurring in the optimal join sequence must satisfy the following two conditions.
1. For every relation R in the neighbourhood of X, there exists a contra-
dictory subchain Y at an earlier position in the list which itself meets
condition 1, such that R occurs in Y , and Y can be followed by X.

2. For every relation R in the complementary neighbourhood of X, there

exists a contradictory subchain Y at a later position in the list which
itself meets condition 2, such that R occurs in the neighbourhood of Y ,
and X can be followed by Y .
92 CHAPTER 3. JOIN ORDERING

Using these two conditions, we can eliminate “useless” contradictory chains

from the rank-ordered list by performing a reachability algorithm for each of
the DAGs defined by the conditions 1 and 2. In the last step of our algorithm,
backtracking is used to enumerate all representations. Suppose that at some
step of the algorithm we have determined an initial sequence of contradictory
subchains and have a rank-sorted list of the remaining possible contradictory
subchains. In addition to the two conditions mentioned above, another reacha-
bility algorithm can be applied to determine the set of reachable relations from
the list (with respect to the given prefix). With the use of this information, all
branches that do not lead to a complete join sequence can be pruned.
Let us analyze the worst case time complexity of the algorithm. The two
dynamic programming steps both iterate over O(n2 ) different extents, and each
extent gives rise to O(n) splittings. Moreover, for each extent one normalization
is necessary, which requires linear time (cost, size and rank can be computed in
constant time using recurrences). Therefore, the complexity of the two dynamic
programming steps is O(n4 ). Sorting O(n2 ) contradictory chains can be done
in time O(n2 log n). The step where all “useless” contradictory subchains are
eliminated, consists of two stages of a reachability algorithm which has com-
plexity O(n4 ). If conjecture 3.2.2 is true, the backtracking step requires linear
time, and the total complexity of the algorithm is O(n4 ). Otherwise, if con-
jecture 3.2.2 is false, the algorithm might exhibit exponential worst case time
complexity.
We now describe how to reduce the problem for our original cost function
C to the problem for the modified cost function C 0 . One difficulty with the
original cost function is that the ranks are defined only for subsequences of at
least two relations. Hence, for determining the first relation in our solution
we do not have sufficient information. An obvious solution to this problem
is to try every relation as starting relation, process each of the two resulting
chain queries separately and choose the chain with minimum costs. The new
complexity will increase by about a factor of n. This first approach is not
very efficient, since the dynamic programming computations overlap consider-
ably, e.g. if we perform dynamic programming on the two overlapping chains
R1 R2 R3 R4 R5 R6 and R2 R3 R4 R5 R6 R7 , for the intersecting chain R2 R3 R4 R5 R6
everything is computed twice. The cue is that we can perform the dynamic pro-
gramming calculations before we consider a particular starting relation. Hence,
the final algorithm can be sketched as follows:
Algorithm CHAIN-I:
1. Compute all optimal contradictory chains by dynamic programming (cor-
responds to the steps 1-4 of Algorithm I’)

2. For each starting relation Ri , perform the following steps:

(a) Let L1 be the result of applying steps 5 and 6 of Algorithm I’ to all

contradictory subchains whose extent (N, M ) satisfies Ri ∈ N and
M ⊆ {R1 , . . . , Ri }.
(b) Let L2 be the result of applying steps 5 and 6 of Algorithm I’ to all
contradictory subchains whose extent (N, M ) satisfies Ri ∈ N and
3.2. DETERMINISTIC ALGORITHMS 93

M ⊆ {Ri , . . . , Rn }.
(c) For all (l1 , l2 ) ∈ L1 × L2 , perform the following steps:
i. Let L be the result of merging l1 and l2 according to their ranks.
ii. Use Ri L to update the current-best join ordering.

Suppose that conjecture 3.2.2 is true, and we can replace the backtracking part
by a search for the first solution. Then the complexity
Pn of the step 1 is O(n4 ),
whereas the complexity of step 2 amounts to i=1 (O(i ) + O(n − i)2 + O(n)) =
2

O(n3 ). Hence, the total complexity would be O(n4 ) in the worst case. Of
course, if our conjecture is false, the necessary backtracking step might lead to
an exponential worst case complexity.

The second algorithm

The second algorithm is much simpler than the first one but proves to be less
efficient in practice. Since the new algorithm is very similar to some parts of
the old one, we just point out the differences between both algorithms. The
new version of the algorithm works as follows.
Algorithm CHAIN-II’:

1. Use dynamic programming to compute an optimal recursive decomposable

chain for the whole set of relations {R1 , . . . , Rn }.

2. Normalize the resulting chain.

3. Reorder the contradictory subchains according to their ranks.

4. De-normalize the sequence.

Step 1 is identical to step 2 of our first algorithm. Note that Lemma 3.2.15
cannot be applied to the sequence in Step 2, since an optimal recursive de-
composable chain is not necessarily an optimal chain. Therefore, the question
arises whether Step 3 really makes sense. One can show that the partial order
defined by the precedence relation among the contradictory subchains has the
property that all elements along paths in the partial order are sorted by rank.
By computing a greedy topological ordering (greedy with respect to the ranks),
we obtain a sequence as requested in step 3.
Let us briefly analyze the worst case time complexity of the second algo-
rithm. The first step requires time O(n4 ), whereas the second step requires time
O(n2 ). The third step has complexity O(n log n). Hence, the total complexity
is O(n4 ).
Algorithm II’ is based on the cost function C 0 . We can now modify the
algorithm for the original cost function C as follows.
Algorithm CHAIN-II:

1. Compute all optimal recursive decomposable chains by dynamic program-

ming (corresponds to step 1 of Algorithm II’)

2. For each starting relation Ri , perform the following steps:

94 CHAPTER 3. JOIN ORDERING

(a) Let L1 be the result of applying the steps 2 and 3 of Algorithm II’ to
all optimal recursive decomposable subchains whose extent (N, M )
satisfies Ri ∈ N and M ⊆ {R1 , . . . , Ri }.
(b) Let L2 be the result of applying the steps 2 and 3 of Algorithm II’ to
all optimal recursive decomposable subchains whose extent (N, M )
satisfies Ri N and M ⊆ {Ri , . . . , Rn }.
(c) Let L be the result of merging L1 and L2 according to their ranks.
(d) De-normalize L.
(e) Use Ri L to update the current-best join ordering.

PThe complexity of Step 1 is O(n4 ), whereas the complexity of Step 2 amounts

to ni=1 (O(i2 ) + O(n − i)2 + O(n)) = O(n3 ). Hence, the time complexity of
Algorithm II is O(n4 ).
Summarizing, we are now left with one algorithm that produces the optimal
result but whose worst-case runtime behavior is unknown and one algorithm
with polynomial runtime but producing a result which has not been proven to
be optimal. Due to this lack of hard facts, Moerkotte and Scheufele ran about
700,000 experiments with random queries of sizes up to 30 relations and fewer
experiments for random queries with up to 300 relations to compare the results
of our algorithms. For n ≤ 15, they additionally compared the results with a
standard dynamic programming algorithm. Their findings can be summarized
as follows.

• All algorithms yielded identical results.

• Backtracking always led to exactly one sequence of contradictory chains.

• In the overwhelming majority of cases the first algorithm proved to be

faster than the second.

Whereas the run time of the second algorithm is mainly determined by the
number of relations in the query, the run time of the first also heavily depends
on the number of existing optimal contradictory subchains. In the worst case,
the first algorithm is slightly inferior to the second. Additionally, Hamalainen
reports on an independent implementation of the second algorithm [388]. He
could not find an example where the second algorithm did not produce the
optimal result either. We encourage the reader to prove that it produces the
EX optimal result.

3.2.8 Transformation-Based Approaches

The idea of transformation-based algorithms can be described as follows. Start-
ing from an arbitrary join tree, equivalences (such as commutativity and asso-
ciativity) are applied to it to derive a set of new join trees. For each of the
join trees, the equivalences are again applied to derive even more join trees.
This procedure is repeated until no new join tree can be derived. This proce-
dure exhaustively enumerates the set of all bushy trees. Furthermore, before an
3.2. DETERMINISTIC ALGORITHMS 95

equivalence is applied, it is difficult to see whether the resulting join tree has al-
ready been produced or not (see also Figure 2.6). Thus, this procedure is highly
inefficient. Hence, it does not play any role in practice. Nevertheless, we give
the pseudo-code for it, since it forms the basis for several of the following algo-
rithms. We split the exhaustive transformation approach into two algorithms.
One that applies all equivalences to a given join tree (ApplyTransformations)
and another that does the loop (ExhaustiveTransformation). A transforma-
tion is applied in a directed way. Thus, we reformulate commutativity and
associativity as rewrite rules using ; to indicate the direction.
The following table summarizes all rules commonly used in transformation-
based and randomized join ordering algorithms. The first three are directly
derived from the commutativity and associativity laws for the join. The other
rules are shortcuts used under special circumstances. For example, left associa-
tivity may turn a left-deep tree into a bushy tree. When only left-deep trees are
to be considered, we need a replacement for left associativity. This replacement
is called left join exchange.

R1 B R2 ; R2 B R1 Commutativity
(R1 B R2 ) B R3 ; R1 B (R2 B R3 ) Right Associativity
R1 B (R2 B R3 ) ; (R1 B R2 ) B R3 Left Associativity
(R1 B R2 ) B R3 ; (R1 B R3 ) B R2 Left Join Exchange
R1 B (R2 B R3 ) ; R2 B (R1 B R3 ) Right Join Exchange

Two more rules are often used to transform left-deep trees. The first opera-
tion (swap) exchanges two arbitrary relations in a left-deep tree. The second
operation (3Cycle) performs a cyclic rotation of three arbitrary relations in a
left-deep tree. To account for different join methods, a rule called join method
exchange is introduced.
The first rule set (RS-0) we are using contains the commutativity rule and
both associativity rules. Applying associativity can lead to cross products. RS-0
If we do not want to consider cross products, we only apply any of the two
associativity rules if the resulting expression does not contain a cross product.
It is easy to extend ApplyTransformations to cover this by extending the if
conditions with
and (ConsiderCrossProducts || connected(·))
where the argument of connected is the result of applying a transformation.

ExhaustiveTransformation({R1 , . . . , Rn })
Input: a set of relations
Output: an optimal join tree
Let T be an arbitrary join tree for all relations
Done = ∅; // contains all trees processed
ToDo = {T }; // contains all trees to be processed
while (!empty(ToDo)) {
Let T be an arbitrary tree in ToDo
96 CHAPTER 3. JOIN ORDERING

ToDo \ = T ;
Done ∪ = T ;
Trees = ApplyTransformations(T );
for all T ∈ Trees do {
if (T 6∈ ToDo ∪ Done) {
ToDo + = T ;
}
}
}
return cheapest tree found in Done;

ApplyTransformations(T )
Input: join tree
Output: all trees derivable by associativity and commutativity
Trees = ∅;
Subtrees = all subtrees of T rooted at inner nodes
for all S ∈ Subtrees do {
if (S is of the form S1 B S2 ) {
Trees + = S2 B S1 ;
}
if (S is of the form (S1 B S2 ) B S3 ) {
Trees + = S1 B (S2 B S3 );
}
if (S is of the form S1 B (S2 B S3 )) {
Trees + = (S1 B S2 ) B S3 ;
}
}
return Trees;

Besides the problems mentioned above, this algorithm also has the problem
that the sharing of subtrees is a non-trivial task. In fact, we assume that
ApplyTransformations produces modified copies of T . To see how ExhaustiveTransformation
works, consider again Figure 2.6. Assume that the top-left join tree is the initial
join tree. Then, from this join tree ApplyTransformations produces all trees
reachable by some edge. All of these are then added to ToDo. The next call
to ApplyTransformations with any to the produced join trees will have the
initial join tree contained in Trees. The complete set of visited join trees after
this step is determined from the initial join tree by following at most two edges.
Let us reformulate the algorithm such that it uses a data structure similar
to dynamic programming or memoization in order to avoid duplicate work. For
any subset of relations, dynamic programming remembers the best join tree.
This does not quite suffice for the transformation-based approach. Instead, we
have to keep all join trees generated so far including those differing in the order
of the arguments or a join operator. However, subtrees can be shared. This
is done by keeping pointers into the data structure (see below). So, the dif-
ference between dynamic programming and the transformation-based approach
becomes smaller. The main remaining difference is that dynamic programming
3.2. DETERMINISTIC ALGORITHMS 97

only considers these join trees while with the transformation-based approach
we have to keep the considered join trees since other join trees (more beneficial)
might be generatable from them.
The data structure used for remembering trees is often called the MEMO
structure. For every subset of relations to be joined (except the empty set), a
class exists in the MEMO structure. Each class contains all the join trees that
join exactly the relations describing the class. Here is an example for join trees
containing three relations.

{R1 , R2 , R3 } {R1 , R2 } B R3 , R3 B {R1 , R2 },

{R1 , R3 } B R2 , R2 B {R1 , R3 },
{R2 , R3 } B R1 , R1 B {R2 , R3 }
{R2 , R3 } {R2 } B {R3 }, {R3 } B {R2 }
{R1 , R3 } {R1 } B {R3 }, {R3 } B {R1 }
{R1 , R2 } {R1 } B {R2 }, {R2 } B {R1 }
{R3 } R3
{R2 } R2
{R1 } R1

Here, we used the set notation {. . .} as an argument to a join to denote a

reference to the class of join trees joining the relations contained in it.
We reformulate our transformation-based algorithm such that it fills in and
uses the MEMO structure [670]. In a first step, the MEMO structure is ini-
tialized by creating an arbitrary join tree for the class {R1 , . . . , Rn } and then
going down this join tree and creating an entry for every join encountered.
Then, we call ExploreClass on the root class comprising all relations to be
joined. ExploreClass then applies ApplyTransformations2 to every member
of the class it is called upon. ApplyTransformations2 then applies all rules to
generate alternatives.

ExhaustiveTransformation2(Query Graph G)
Input: a query specification for relations {R1 , . . . , Rn }.
Output: an optimal join tree
initialize MEMO structure
ExploreClass({R1 , . . . , Rn })
return best of class {R1 , . . . , Rn }

ExploreClass(C)
Input: a class C ⊆ {R1 , . . . , Rn }
Output: none, but has side-effect on MEMO-structure
while (not all join trees in C have been explored) {
choose an unexplored join tree T in C
ApplyTransformation2(T )
mark T as explored
98 CHAPTER 3. JOIN ORDERING

}
return

ApplyTransformations2(T )
Input: a join tree of a class C
Output: none, but has side-effect on MEMO-structure
ExploreClass(left-child(T ));
ExploreClass(right-child(T ));
foreach transformation T and class member of child classes {
foreach T 0 resulting from applying T to T {
if T 0 not in MEMO structure {
add T 0 to class C of MEMO structure
}
}
}
return

ApplyTransformations2 uses a set of transformations to be applied. We dis-

cuss now the effect of different transformation sets on the complexity of the
algorithm. Applying ExhaustiveTransformation2 with a rule set consisting
of Commutativity and Left and Right Associativity generates 4n −3n+1 +2n+2 −
n − 2 duplicates for n relations. Contrast this with the number of join trees
contained in a completely filled MEMO structure3 : 3n − 2n+1 + n + 1. This
clearly shows the problem.
The problem of generating the same join tree several times was considered
by Pellenkoft, Galindo-Legaria, and Kersten [670, 671, 672]. The solution lies in
parameterizing ExhaustiveTransformation2 by an appropriate set of transfor-
mations. The basic idea is to remember for every join operator which rules are
applicable to it. For example, after applying commutativity to a join operator,
we disable commutativity for it.
For acyclic queries, the following rule set guarantees that all bushy join
trees are generated, but no duplicates [672]. Thereby, cross products are not
considered. That is, a rule is only applicable if it does not result in a cross
product. This restricts the applicability of the above algorithm to connected
queries. We use Ci to denote some class of the MEMO structure. We call the
RS-1 following rule set RS-1:

T1 : Commutativity C1 B0 C2 ; C2 B1 C1
Disable all transformations T1 , T2 , and T3 for B1 .

T2 : Right Associativity (C1 B0 C2 ) B1 C3 ; C1 B2 (C2 B3 C3 )

Disable transformations T2 and T3 for B2 and enable all rules for B3 .

T3 : Left associativity C1 B0 (C2 B1 C3 ) ; (C1 B2 C2 ) B3 C3

Disable transformations T2 and T3 for B3 and enable all rules for B2 .
3
The difference to the according number for dynamic programming is due to the fact that
we have to keep alternatives generated by commutativity and that join trees for single relations
are counted.
3.2. DETERMINISTIC ALGORITHMS 99

Class Initialization Transformation Step

{R1 , R2 , R3 , R4 } {R1 , R2 } B111 {R3 , R4 } {R3 , R4 } B000 {R1 , R2 } 3
R1 B100 {R2 , R3 , R4 } 4
{R1 , R2 , R3 } B100 R4 5
{R2 , R3 , R4 } B000 R1 8
R4 B000 {R1 , R2 , R3 } 10

{R2 , R3 , R4 } R2 B111 {R3 , R4 } 4

{R3 , R4 } B000 R2 6
{R2 , R3 } B100 R4 6
R4 B000 {R2 , R3 } 7
{R1 , R3 , R4 }
{R1 , R2 , R4 }
{R1 , R2 , R3 } {R1 , R2 } B111 R3 5
R3 B000 {R1 , R2 } 9
R1 B100 {R2 , R3 } 9
{R2 , R3 } B000 R1 9
{R3 , R4 } R3 B111 R4 R4 B000 R3 2
{R2 , R4 }
{R2 , R3 }
{R1 , R4 }
{R1 , R3 }
{R1 , R2 } R1 B111 R2 R2 B000 R1 1

Figure 3.14: Example of rule transformations (RS-1)

In order to be able to follow these rules, the procedure ApplyTransformations2

has to be enhanced such that it is able to keep track of the application history
of the rules for every join operator. The additional memory requirement is
neglectible, since a single bit for each rules suffices.
As an example, let us consider the chain query R1 −R2 −R3 −R4 . Figure 3.14
shows the MEMO structure. The first column gives the sets of the relations
identifying each class. We leave out the single relation classes assuming that
{Ri } has Ri as its only join tree which is marked as explored.
The second column shows the initialization with an arbitrarily chosen join
tree. The third column is the one filled by the Apply Transformation2 proce-
dure. We apply the rule set RS-1, which consists of three transformations. Each
join is annotated with three bits, where the i-th bit indicates whether Ti is appli-
cable (1) or not (0). After initializing the MEMO structure, ExhaustiveTransformation2
calls ExploreClass for {R1 , R2 , R3 , R4 }. The only (unexplored) join tree is
{R1 , R2 }B111 {R3 , R4 }, which will become the argument of ApplyTransformations2.
Next, ExploreClass is called on {R1 , R2 } and {R3 , R4 }. In both cases, T1 is
the only applicable rule, and the result is shown in the third column under steps
1 and 2. Now we have to apply all transformations on {R1 , R2 } B111 {R3 , R4 }.
100 CHAPTER 3. JOIN ORDERING

Commutativity T1 gives us {R3 , R4 } B000 {R1 , R2 } (Step 3). For right associa-
tivity, we have two elements in class {R1 , R2 }. Substituting them and applying
T2 gives

1. (R1 B R2 ) B {R3 , R4 } ; R1 B100 (R2 B111 {R3 , R4 })

2. (R2 B R1 ) B {R3 , R4 } ; R2 B111 (R1 A {R3 , R4 })

The latter contains a cross product. This leaves us with the former as the result
of Step 4. The right argument of the top most join is R2 B111 {R3 , R4 }. Since
we do not find it in class {R2 , R3 , R4 }, we add it (4).
T3 is next.

1. {R1 , R2 } B (R3 B R4 ) ; ({R1 , R2 } B111 R3 ) B100 R4

2. {R1 , R2 } B (R4 B R3 ) ; ({R1 , R2 } A R4 ) B100 R3

The latter contains a cross product. This leaves us with the former as the result
of Step 5. We also add {R1 , R2 } B111 R3 . Now that {R1 , R2 } B111 {R3 , R4 } is
completely explored, we turn to {R3 , R4 }B000 {R1 , R2 }, but all transformations
are disabled here.
R1 B100 {R2 , R3 , R4 } is next. First, {R2 , R3 , R4 } has to be explored. The
only entry is R2 B111 {R3 , R4 }. Remember that {R3 , R4 } is already explored.
T2 is not applicable. The other two transformations give us

T1 {R3 , R4 } B000 R2

T3 (R2 B000 R3 ) B100 R4 and (R2 A R4 ) B100 R3

Those join trees not exhibiting a cross product are added to the MEMO struc-
ture under 6. Applying commutativity to {R2 , R4 } B100 R3 gives 7. Commuta-
tivity is the only rule enabled for R1 B100 {R2 , R3 , R4 }. Its application results
in 8.
{R1 , R2 , R3 } B100 R4 is next. It is simple to explore the class {R1 , R2 , R3 }
with its only entry {R1 , R2 } B111 R3 :

T1 R3 B000 {R1 , R2 }

T2 R1 B100 (R2 B111 R3 ) and R2 B100 (R1 A R3 )

Commutativity can still be applied to R1 B100 (R2 B111 R3 ). All the new entries
are numbered 9. Commutativity is the only rule enabled for {R1 , R2 , R3 }B100 R4
Its application results in 10.
2
The next two sets of transformations were originally intended for generating
all bushy/left-deep trees for a clique query [671]. They can, however, also be
used to generate all bushy trees when cross products are considered. The rule
set RS-2 for bushy trees is

T1 : Commutativity C1 B0 C2 ; C2 B1 C1
Disable all transformations T1 , T2 , T3 , and T4 for B1 .
3.3. PROBABILISTIC ALGORITHMS 101

T2 : Right Associativity (C1 B0 C2 ) B1 C3 ; C1 B2 (C2 B3 C3 )

Disable transformations T2 , T3 , and T4 for B2 .

T3 : Left Associativity C1 B0 (C2 B1 C3 ) ; (C1 B2 C2 ) B3 C3

Disable transformations T2 , T3 and T4 for B3 .

T4 : Exchange (C1 B0 C2 ) B1 (C3 B2 C4 ) ; (C1 B3 C3 ) B4 (C2 B5 C4 )

Disable all transformations T1 , T2 , T3 , and T4 for B4 .

If we initialize the MEMO structure with left-deep trees, we can strip down
the above rule set to Commutativity and Left Associativity. The reason is an
observation made by Shapiro et al.: from a left-deep join tree we can generate
all bushy trees with only these two rules [787].
If we want to consider only left-deep trees, the following rule set RS-3 is
appropriate:

T1 Commutativity R1 B0 R2 ; R2 B1 R1
Here, the Ri are restricted to classes with exactly one relation. T1 is
disabled for B1 .

T2 Right Join Exchange (C1 B0 C2 ) B1 C3 ; (C1 B2 C3 ) B3 C2

Disable T2 for B3 .

3.3 Probabilistic Algorithms

3.3.1 Generating Random Left-Deep Join Trees with Cross Prod-
ucts
The basic idea of the algorithms in this section and the following sections is to
generate a set of randomly chosen join trees, evaluate their costs, and return
the best one. The problem with this approach lies in the random generation of
join trees: every join tree has to be generated with equal probability. Although
there are some advocates of the pure random approach [299, 300, 302, 298],
typically a random join tree or a set of random join trees is used in subsequent
algorithms like iterative improvement and simulated annealing.
Obviously, if we do not consider cross products the problem is really hard,
since the query graph plays an important role. So let us start with the simplest
case where random join trees are generated that might contain cross products
even for connected query graphs. Then, any join tree is a valid join tree.
The general idea behind all algorithms is the following. Assume that the
number of join trees in the considered search space is known to be N . Then,
instead of generating a random join tree directly, a bijective mapping from
the interval of non-negative integers [0, N [ to a join tree in the search space is
established. Then, a random join tree can be generated by (1) generating a
random number in [0, N [ and (2) mapping the number to the join tree. The
problem of bijectively mapping an interval of non-negative integers to elements
of a set is usually called unranking. The opposite mapping is called ranking.
Obviously, the crux in our case is the efficiency of the unranking problem.
102 CHAPTER 3. JOIN ORDERING

We start with generating random left-deep join trees for n relations. This
problem is identical to generating random permutations. That is, we look for
a fast unranking algorithm that maps the non-negative integers in [0, n![ to
permutations. Let us consider permutations of the numbers {0, . . . , n − 1}.
A mapping between these numbers and relations is established easily, e.g. via
an array. The traditional approach to ranking/unranking of permutations is
to first define an ordering on the permutations and then find a ranking and
unranking algorithm relative to that ordering. For the lexicographic order, al-
gorithms require O(n2 ) time [547, 712]. More sophisticated algorithms separate
the ranking/unranking algorithms into two phases. For ranking, first the in-
version vector of the permutation is established. Then, ranking takes place for
the inversion vector. Unranking works in the opposite direction. The inver-
sion vector of a permutation π = π0 , . . . , πn−1 is defined to be the sequence
v = v0 , . . . , vn−1 , where vi is equal to the number of entries πj with πj > πi
and j < i. Inversion vectors uniquely determine a permutation [863]. However,
naive algorithms of this approach again require O(n2 ) time. Better algorithms
require O(n log n). Using an elaborated data structure, Dietz’ algorithm re-
quires O((n log n)/(log log n)) [238]. Other orders like the Steinhaus-Johnson-
Trotter order have been exploited for ranking/unranking but do not yield any
run-time advantage over the above mentioned algorithms (see [511, 712]).
Since it is not important for our problem that any order constraints are sat-
isfied for the ranking/unranking functions, we use the fastest possible algorithm
established by Myrvold and Ruskey [625]. It runs in O(n) which is also easily
seen to be a lower bound.
The algorithm is based on the standard algorithm to generate random per-
mutations [220, 247, 619]. An array π is initialized such that π[i] = i for
0 ≤ i ≤ n − 1. Then, the loop

for (k = n − 1; k >= 0; − − k) swap(π[k], π[random(k)]);

is executed where swap exchanges two elements and random(k) generates a

random number in [0, k]. This algorithm randomly picks any of the possible
permutations. Assume the random elements produced by the algorithm are
rn−1 , . . . , r0 where 0 ≤ ri ≤ i. Obviously, there are exactly n(n − 1)(n −
2) . . . 1 = n! such sequences and there is a one-to-one correspondence between
these sequences and the set of all permutations. We can thus unrank r ∈ [0, n![
by turning it into a unique sequence of values rn−1 , . . . , r0 . Note that after
executing the swap with rn−1 , every value in [0, n[ is possible at position π[n−1].
Further, π[n−1] is never touched again. Hence, we can unrank r as follows. We
first set rn−1 = r mod n and perform the swap. Then, we define r0 = br/nc
and iteratively unrank r0 to construct a permutation of n − 1 elements. The
following algorithm realizes this idea.

Unrank(n, r) {
Input: the number n of elements to be permuted
and the rank r of the permutation to be constructed
3.3. PROBABILISTIC ALGORITHMS 103

Output: a permutation π
for (i = 0; i < n; + + i) π[i] = i;
Unrank-Sub(n, r, π);
return π;
}
}

Unrank-Sub(n, r, π) {
for (i = n; i > 0; − − i) {
swap(π[i − 1], π[r mod i]);
r = br/ic;
}
}

3.3.2 Generating Random Join Trees with Cross Products

Next, we want to randomly construct bushy plans possibly containing cross
products. This is done in several steps:
1. Generate a random number b in [0, C(n − 1)[.
2. Unrank b to obtain a bushy tree with n − 1 inner nodes.
3. Generate a random number p in [0, n![.
4. Unrank p to obtain a permutation.
5. Attach the relations in order p from left to right as leaf nodes to the binary
tree obtained in Step 2.
The only step that we still have to discuss is Step 2. It is a little involved and we
can only try to bring across the general idea. For details, the reader is referred
to the literature [547, 548, 549].
Consider Figure 3.15. It contains all 14 possible trees with four inner nodes.
The trees are ordered according to the rank we will consider. The bottom-most
number below any tree is its rank in [0, 14[. While unranking, we do not generate
the trees directly, but an encoding of the tree instead. This encoding works as
follows. Any binary tree corresponds to a word in a Dyck language with one
pair of parenthesis. The alphabet hence consists of Σ = {0 (0 , 0 )0 }. For join trees
with n inner nodes, we use Dyck words of length 2n whose parenthesization is
correct. That is, for every 0 (0 , we have a subsequent 0 )0 . From a given join tree,
we obtain the Dyck word by a preorder traversal. Whenever we encounter an
inner node, we encode this with a 0 (0 . All but the last leaf nodes are encoded
by a 0 )0 . Appending all these 2n encodings gives us a Dyck word of length 2n.
Figure 3.15 shows directly below each tree its corresponding Dyck word. In the
line below, we simply changed the representation by substituting every 0 (0 by a
0 10 and every 0 )0 by a 0 00 . The encoding that will be generated by the unranking

algorithm is shown in the third line below each tree: we remember the places
(index in the bit-string) where we find a 0 10 .
104 CHAPTER 3. JOIN ORDERING

B B B B B

B B B B B B

B B B

(((()))) ( ( ( )( ) ) ) ((())()) ((()))() (()(()))

11110000 11101000 11100100 11100010 11011000
1, 2, 3, 4 1, 2, 3, 5 1, 2, 3, 6 1, 2, 3, 7 1, 2, 4, 5
0 1 2 3 4

B B B B B

B B B B B B B B

B B B B B

B B

(()()()) (()())() (())(()) (())()() ()((()))

11010100 11010010 11001100 11001010 10111000
1, 2, 4, 6 1, 2, 4, 7 1, 2, 5, 6 1, 2, 5, 7 1, 3, 4, 5
5 6 7 8 9

B B B B

B B B B B

B B B

()(()()) ()(())() ()()(()) ()()()()

10110100 10110010 10101100 10101010
1, 3, 4, 6 1, 3, 4, 7 1, 3, 5, 6 1, 3, 5, 7
10 11 12 13

Figure 3.15: Encoding Trees

In order to do the unranking, we need to do some counting. Therefor, we

map Dyck words to paths in a triangular grid. For n = 4 this grid is shown in
Figure 3.16. We always start at (0, 0) which means that we have not opened a
parenthesis. When we are at (i, j), opening a parenthesis corresponds to going
to (i + 1, j + 1) and closing a parenthesis to going to (i + 1, j − 1). We have
thus established a bijective mapping between Dyck words and paths in the grid.
Thus counting Dyck words corresponds to counting paths.
3.3. PROBABILISTIC ALGORITHMS 105

4 1
[0,0]

3 4 1
[1,4[

2 9 3 1
[4,9[

1 14 5 2
[9,14[

1 2 3 4 5 6 7 8

Figure 3.16: Paths

The number of different paths from (0, 0) to (i, j) can be computed by

j+1 i+1
p(i, j) =
i + 1 21 (i + j) + 1

These numbers are called the Ballot numbers [129]. The number of paths from
(i, j) to (2n, 0) can thus be computed as (see [548, 549]):

q(i, j) = p(2n − i, j)

Note the special case q(0, 0) = p(2n, 0) = C(n). In Figure 3.16, we annotated
nodes (i, j) by p(i, j). These numbers can be used to assign (sub-) intervals to
paths (Dyck words, trees). For example, if we are at (4, 4), there exists only
a single path to (2n, 0). Hence, the path that travels the edge (4, 4) → (5, 3)
has rank 0. From (3, 3) there are four paths to (2n, 0), one of which we already
considered. This leaves us with three paths that travel the edge (3, 3) → (4, 2).
The paths in this part as assigned ranks in the interval [1, 4[. Figure 3.16 shows
the intervals near the edges. For unranking, we can now proceed as follows.
Assume we have a rank r. We consider opening a parenthesis (go from (i, j) to
(i + 1, j + 1)) as long as the number of paths from that point does no longer
exceed our rank r. If it does, we close a parenthesis instead (go from (i, j) to
(i − 1, j + 1)). Assume, that we went upwards to (i, j) and then had to go down
to (i − 1, j + 1). We subtract the number of paths from (i + 1, j + 1) from our
rank r and proceed iteratively from (i − 1, j + 1) by going up as long as possible
and going down again. Remembering the number of parenthesis opened and
closed along our way results in the required encoding. The following algorithm
finalizes these ideas.

UnrankTree(n, r)
Input: a number of inner nodes n and a rank r ∈ [0, C(n − 1)]
106 CHAPTER 3. JOIN ORDERING

Output: encoding of the inner leaves of a tree

lNoParOpen = 0;
lNoParClose = 0;
i = 1; // current encoding
j = 0; // current position in encoding array
while (j < n) {
k = q(lNoParOpen + lNoParClose + 1, lNoParOpen - lNoParClose + 1);
if (k ≤ r) {
r -= k;
++lNoParClose;
} else {
aTreeEncoding[j++] = i;
++lNoParOpen;
}
++i;
}

Given an array with the encoding of a tree, it is easy to construct the tree
from it. The following procedure does that.

TreeEncoding2Tree(n, aEncoding) {
Input: the number of internal nodes of the tree n
Output: root node of the result tree
root = new Node; /* root of the result tree */
curr = root; /* curr: current internal node whose subtrees are to be created */
i = 1; /* pointer to entry in encoding */
child = 0; /* 0 = left , 1 = right: next child whose subtree is to be created */
while (i < n) {
lDiff = aEncoding[i] - aEncoding[i − 1];
for (k = 1; k < lDif f ; + + k) {
if (child == 0) {
curr->addLeftLeaf();
child = 1;
} else {
curr->addRightLeaf();
while (curr->right() != 0) {
curr = curr->parent();
}
child = 1;
}
}
if (child == 0) {
curr->left(new Node(curr)); // curr becomes parent of new node
curr = curr->left();
++i;
3.3. PROBABILISTIC ALGORITHMS 107

child = 0;
} else {
curr->right(new Node(curr));
curr = curr->right();
++i;
child = 0;
}
}
while (curr != 0) {
curr->addLeftLeaf(); // addLeftLeaf adds leaf if no left-child exists
curr->addRightLeaf(); // analogous
curr = curr->parent();
}
return root;
}

3.3.3 Generating Random Join Trees without Cross Products

A general solution for randomly generating join trees without cross products is
not known. However, if we restrict ourselves to acyclic queries, we can apply
an algorithm developed by Galindo-Legaria, Pellenkoft, and Kersten [300, 299,
302]. For this algorithm to work, we have to assume that the query graph is
connected and acyclic.
For the rest of this section, we assume that G = (V, E) is the query graph
and |V | = n. That is, n relations are to be joined. No join tree contains a cross
product. With every node in a join tree, we associate a level . The root has
level 0. Its children have level 1, and so on. We further use lower-case letters
for relations.
For a given query graph G, we denote by TG the set of join trees for G.
v(k)
Let TG ⊆ TG be the subset of join trees where the leaf node (i.e. relation) v
occurs at level k. Some trivial observations follow. If the query graph consists
v(0)
of a single node (n = 1), then |TG | = |TG | = 1. If n > 1, the top node in
v(0)
the join tree is a join and not a relation. Hence, |TG | = 0. Obviously, the
v(k)
maximum level that can occur in any join tree is n − 1. Hence, |TG | = 0
if k ≥ n. Since the level at which a leaf node v occurs in some join tree is
v(k) v(i) v(j)
unique, we have TG = ∪nk=0 TG and TG ∩ TG = ∅ for i 6= j. This gives us
Pn v(k)
|TG | = k=0 |TG |.
The algorithm generates an unordered tree with n leaf nodes. If we wish
to have a random ordered tree, we have to pick one of the 2n−1 possibilities to
order the (n − 1) joins within the tree. We proceed as follows. We start with
some notation for lists, discuss how two lists can be merged, describe how a
specific merge can be specified, and count the number of possible merges. This
is important, since join trees will be described as lists of trees. Given a leaf
node v, we simply traverse the path from the root to v. Thereby, subtrees that
branch off can be collected into a list of trees. After these remarks, we start
developing the algorithm in several steps. First, we consider two operations
108 CHAPTER 3. JOIN ORDERING

R1 S1 R1 R1 S1

R2 S2 S1 R2 R1

R2 S1 R2
v v
S2 S2 S2
R S
v v v

(R, S, [1, 1, 0]) (R, S, [2, 0, 0]) (R, S, [0, 2, 0])

Figure 3.17: Tree-merge

with which we can construct new join trees: leaf-insertion introduces a new
leaf node into a given tree and tree-merging merges two join trees. Since we
do not want to generate cross products in this section, we have to apply these
operations carefully. Therefor, we need a description of how to generate all
valid join trees for a given query graph. The central data structure for this
purpose is the standard decomposition graph (SDG). Hence, in the second step,
we define SDGs and introduce an algorithm that derives an SDG from a given
query graph. In the third step, we start counting. The fourth and final step
consists of the unranking algorithm. We do not discuss the ranking algorithm.
It can be found in [302].
We use the Prolog notation | to separate the first element of a list from its
tail. For example, the list ha|ti has a as its first element and a tail t. Assume
that P is a property of elements. A list l0 is the projection of a list L on P , if
L0 contains all elements of L satisfying the property P . Thereby, the order is
retained. A list L is a merge of two disjoint lists L1 and L2 if L contains all
elements from L1 and L2 and both are projections of L.
A merge of a list L1 with a list L2 whose respective lengths are l1 and l2
can be described by an array α = [α0 , . . . , αl2 ] of non-negative integers whose
sum is equal to l1 . The non-negative integer αi−1 gives the number of elements
of L1 which precede the i-th element of L2 in the merged list. We obtain the
merged list L by first taking α0 elements from L1 . Then, an element from L2
follows. Then α1 elements from L1 and the next element of L2 follow and so
on. Finally follow the last αl2 elements of L1 . Figure 3.17 illustrates possible
merges.
Compare list merges to the problem of non-negative (weak) integer com-
position [?]. There, we ask for the number of compositions
P of a non-negative
integer n into k non-negative integers αi with ki=1 αk = n. The answer is
n+k−1

k−1 [818]. Since we have to decompose l1 into l2 + 1 non-negative inte-

gers, the number of possible merges is M (l1 , l2 ) = l1 l+l 2
2
. The observation
3.3. PROBABILISTIC ALGORITHMS 109

M (l1 , l2 ) = M (l1 − 1, l2 ) + M (l1 , l2 − 1) allows us to construct an array of size

n ∗ n in O(n2 ) that materializes the values for M . This array will allow us to
rank list merges in O(l1 + l2 ).
The idea for establishing a bijection between [1, M (l1 , l2 )] and the possi-
ble αs is a general one and used for all subsequent algorithms of this section.
Assume that we want to rank the elements of some set S and S = ∪ni=0 Si is
partitioned into disjoint Si . If we want to rank x ∈ Sk , we first find the local
rank of x ∈ Sk . The rank of x is then defined as
k−1
X
|Si | + local-rank(x, Sk )
i=0

To unrank some number r ∈ [1, N ], we first find k such that

j
X
k = min(r ≤ |Si |)
j
i=0

Then, we proceed by unranking with the new local rank

k−1
X
r0 = r − |Si |
i=0

within Sk .
Accordingly, we partition the set of all possible merges into subsets. Each
subset is determined by α0 . For example, the set of possible merges of two
lists L1 and L2 with length l1 = l2 = 4 is partitioned into subsets with α0 = j
for 0 ≤ j ≤ 4. In each partition, we have M (j, l2 − 1) elements. To unrank
a number P r ∈ [1, M (l1 , l2 )], we first determine the partition by computing k =
j 0
minj r ≤ i=0 M (j, l2 − 1). Then, α0 = l1 − k. With the new rank r =
Pk
r − i=0 M (j, l2 − 1), we start iterating all over. The following table gives
the numbers for our example and can be used to understand the unranking
algorithm. The algorithm itself can be found in Figure 3.18.

k α0 (k, l2 − 1) M (k, l2 − 1) rank intervals

0 4 (0, 3) 1 [1, 1]
1 3 (1, 3) 4 [2, 5]
2 2 (2, 3) 10 [6, 15]
3 1 (3, 3) 20 [16, 35]
4 0 (4, 3) 35 [36, 70]

We now turn to the anchored list representation of join trees.

Definition 3.3.1 Let T be a join tree and v be a leaf of T . The anchored list
representation L of T is constructed as follows:
• If T consists of the single leaf node v, then L = hi.

• If T = (T1 B T2 ) and without loss of generality v occurs in T2 , then

L = hT1 |L2 i, where L2 is the anchored list representation of T2 .
110 CHAPTER 3. JOIN ORDERING

UnrankDecomposition(r, l1 , l2 )
Input: a rank r, two list sizes l1 and l2
Output: a merge specification α.
for (i = 0; i ≤ l2 ; + + i) {
alpha[i] = 0;
}
i = k = 0;
while (l1 > 0 && l2 > 0) {
m = M (k, l2 − 1);
if (r ≤ m) {
alpha[i + +] = l1 − k;
l1 = k;
k = 0;
− − l2 ;
} else {
r− = m;
+ + k;
}
}
alpha[i] = l1 ;
return alpha;

Figure 3.18: Algorithm UnrankDecomposition

T1 v T1 T1

T1 v T2
T2 T2
v
w T2 w w w

T (T, 1) (T, 2) (T, 3)

Figure 3.19: Leaf-insertion

We then write T = (L, v).

v(k)
Observe that if T = (L, v) ∈ TG , then T ∈ TG ≺ |L| = k.
The operation leaf-insertion is illustrated in Figure 3.19. A new leaf v is
inserted into the tree at level k. Formally, it is defined as follows.
3.3. PROBABILISTIC ALGORITHMS 111

e +e [0, 5, 5, 5, 3]
e c ∗c [0, 0, 2, 3]
a b c d
b d
[0, 1, 1] +c +c [0, 1]
a
[0, 1] +b d [1]

[1] a

Figure 3.20: A query graph, its tree, and its standard decomposition graph

Definition 3.3.2 Let G = (V, E) be a query graph, T a join tree of G. v ∈ V

be such that G0 = G|V \{v} is connected, (v, w) ∈ E, 1 ≤ k < n, and

T = (hT1 , . . . , Tk−1 , v, Tk+1 , . . . , Tn i, w) (3.10)

0
T = (hT1 , . . . , Tk−1 , Tk+1 , . . . , Tn i, w). (3.11)

Then we call (T 0 , k) an insertion pair on v and say that T is decomposed into

(or constructed from) the pair (T 0 , k) on v.
v(k)
Observe that leaf-insertion defines a bijective mapping between TG and inser-
w(i)
tion pairs (T 0 , k) on v, where T 0 is an element of the disjoint union ∪n−2
i=k−1 TG0 .
The operation tree-merging is illustrated in Figure 3.17. Two trees R =
(LR , w) and S = (LS , w) on a common leaf w are merged by merging their
anchored list representations.

Definition 3.3.3 Let G = (V, E) be a query graph, w ∈ V , T = (L, w) a

join tree of G, V1 , V2 ⊆ V such that G1 = G|V1 and G2 = G|V2 are connected,
V1 ∪ V2 = V , and V1 ∩ V2 = {w}. For i = 1, 2:

• Define the property Pi to be “every leaf of the subtree is in Vi ”,

• Let Li be the projection of L on Pi .

• Ti = (Li , w).

Let α be the integer composition such that L is the result of merging L1 and L2
on α. Then we call (T1 , T2 , α) a merge triplet. We say that T is decomposed
into (constructed from) (T1 , T2 , α) on V1 and V2 .

Observe that the tree-merging operation defines a bijective mapping between

w(k) w(i) w(k−i)
TG and merge triplets (T1 , T2 , α), where T1 ∈ TG1 , T2 ∈ TG2 , and α
specifies a merge of two lists of sizes i and k − i. Further,
theknumber of these
merges (i.e. the number of possibilities for α) is i+(k−i)
k−i = i .
A standard decomposition graph of a query graph describes the possible
constructions of join trees. It is not unique (for n > 1) but anyone can be used
112 CHAPTER 3. JOIN ORDERING

to construct all possible unordered join trees. For each of our two operations it
has one kind of inner nodes. A unary node labeled +v stands for leaf-insertion
of v. A binary node labeled ∗w stands for tree-merging its subtrees whose only
common leaf is w.
The standard decomposition graph of a query graph G = (V, E) is con-
structed in three steps:
1. pick an arbitrary node r ∈ V as its root node;

2. transform G into a tree G0 by directing all edges away from r;

3. call QG2SDG(G0 , r)
with

QG2SDG(G0 , r)
Input: a query tree G0 = (V, E) and its root r
Output: a standard query decomposition tree of G0
Let {w1 , . . . , wn } be the children of v;
switch (n) {
case 0: label v with "v";
case 1:
label v as "+v ";
QG2SDG(G0 , w1 );
otherwise:
label v as "∗v ";
create new nodes l, r with label +v ;
E \ = {(v, wi )|1 ≤ i ≤ n};
E ∪ = {(v, l), (v, r), (l, w1 )} ∪ {(r, wi )|2 ≤ i ≤ n};
QG2SDG(G0 , l);
QG2SDG(G0 , r);
}
return G0 ;

Note that QG2SDG transforms the original graph G0 into its SDG by side-effects.
Thereby, the n-ary tree is transformed into a binary tree similar to the procedure
described by Knuth [496, Chap 2.3.2]. Figure 3.20 shows a query graph G, its
tree G0 rooted at e, and its standard decomposition tree.
v(k)
For an efficient access to the number of join trees in some partition TG
in the unranking algorithm, we materialize these numbers. This is done in the
count array. The semantics of a count array [c0 , c1 , . . . , cn ] of a node u with
label ◦v (◦ ∈ {+, ∗}) of the SDG is that u can construct ci different trees in
which leaf v is at level i. Then, the total number of trees for a query can be
computed by summing up all the ci in the count array of the root node of the
decomposition tree.
To compute the count and an additional summand adornment of a node
labeled +v , we use the following lemma.
3.3. PROBABILISTIC ALGORITHMS 113

Lemma 3.3.4 Let G = (V, E) be a query graph with n nodes, v ∈ V such that
G0 = G|V \v is connected, (v, w) ∈ E, and 1 ≤ k < n. Then
v(k)
X w(i)
|TG | = |TG0 |
i≥k−1

This lemma follows from the observation made after the definition of the leaf-
insertion operation.
w(i)
The sets TG0 used in the summands of Lemma 3.3.4 directly correspond
v(k),i v(k),i
to subsets TG (k − 1 ≤ i ≤ n − 2) defined such that T ∈ TG if
v(k)
1. T ∈ TG ,
2. the insertion pair on v of T is (T 0 , k), and
w(i)
3. T 0 ∈ TG0 .
v(k),i w(i)
Further, |TG | = |TG0 |. For efficiency, we materialize the summands in an
array of arrays summands.
To compute the count and summand adornment of a node labeled ∗v , we use
the following lemma.
Lemma 3.3.5 Let G = (V, E) be a query graph, w ∈ V , T = (L, w) a join
tree of G, V1 , V2 ⊆ V such that G1 = G|V1 and G2 = G|V2 are connected,
V1 ∪ V2 = V , and V1 ∩ V2 = {v}. Then
v(k)
X k v(i) v(k−i)
|TG | = |TG1 | |TG2 |
i
i

This lemma follows from the observation made after the definition of the tree-
merge operation.
w(i)
The sets TG0 used in the summands of Lemma 3.3.5 directly correspond
v(k),i v(k),i
to subsets TG (0 ≤ i ≤ k) defined such that T ∈ TG if
v(k)
1. T ∈ TG ,
2. the merge triplet on V1 and V2 of T is (T1 , T2 , α), and
v(i)
3. T1 ∈ TG1 .
v(k),i v(i) v(k−i)
Further, |TG | = ki |TG1 | |TG2 |.
Before we come to the algorithm for computing the adornments count and
summands, let us make one observation that follows directly from the above
two lemmata. Assume a node v whose count array is [c1 , . . . , cP m ] and whose
summands is s = [s0 , . . . , sn ] with si = [si0 , . . . , sim ], then ci = m i
j=0 sj holds.
Figure 3.21 contains the algorithm to adorn SDG’s nodes with count and
summands. It has worst-case complexity O(n3 ). Figure 3.20 shows the count
adornment for the SDG. Looking at the count array of the root node, we see
that the total number of join trees for our example query graph is 18.
The algorithm UnrankLocalTreeNoCross called by UnrankTreeNoCross adorns
the standard decomposition graph with insert-at and merge-using annota-
tions. These can then be used to extract the join tree.
114 CHAPTER 3. JOIN ORDERING

Adorn(v)
Input: a node v of the SDG
Output: v and nodes below are adorned by count and summands
Let {w1 , . . . , wn } be the children of v;
switch (n) {
case 0: count(v) := [1]; // no summands for v
case 1:
Adorn(w1 );
assume count(w1 ) = [c10 , . . . , c1m1 ]; P 1
count(v) = [0, c1 , . . . , cm1 +1 ] where ck = m i=k−1 ci ;
1

summands(v) 0
= [s , . . . , s m 1 +1 ] where s = [s0 , . . . , skm1 +1 ] and
k k
1
ci if 0 < k and k − 1 ≤ i
ski =
0 else
case 2:
Adorn(w1 );
Adorn(w2 );
assume count(w1 ) = [c10 , . . . , c1m1 ];
assume count(w2 ) = [c20 , . . . , c2m2 ];
count(v) = [c0 , . . . , cm1 +m2 ] where
P 1 k 1 2
ck = m 2
i=0 i ci ck−i ; // ci = 0 for i 6∈ {0, . . . , m2 }
0 m +m ] where sk = [sk0 , . . . , skm1 ] and
k 1 2= [s , . . . , s
summands(v) 1 2

ski = i ci ck−i if 0 ≤ k − i ≤ m2
0 else
}

Figure 3.21: Algorithm Adorn

UnrankTreeNoCross(r,v)
Input: a rank r and the root v of the SDG
Output: adorned SDG
let count(v) = [x0 , . . . , xm ];
P
k := minj r ≤ ji=0 xi ; // efficiency: binary search on materialized sums.
P
r0 := r − k−1
i=0 xi ;
UnrankLocalTreeNoCross(v, r0 , k);

e(k)
The following table shows the intervals associated with the partitions TG for
the standard decomposition graph in Figure 3.20:
Partition Interval
e(1)
TG [1, 5]
e(2)
TG [6, 10]
e(3)
TG [11, 15]
e(4)
TG [16, 18]
3.3. PROBABILISTIC ALGORITHMS 115

The unranking procedure makes use of unranking decompositions and un-

ranking triples. For the latter and a given X, Y, Z, we need to assign each
member in

{(x, y, z)|1 ≤ x ≤ X, 1 ≤ y ≤ Y, 1 ≤ z ≤ Z}

a unique number in [1, XY Z] and base an unranking algorithm on this assign-

ment. We leave this as a simple exercise to the reader and call the function
UnrankTriplet(r, X, Y, Z). Here, r is the rank and X, Y , and Z are the upper
bounds for the numbers in the triplets. The code for unranking looks as follows:

UnrankingTreeNoCrossLocal(v, r, k)
Input: an SDG node v, a rank r, a number k identifying a partition
Output: adornments of the SDG as a side-effect
Let {w1 , . . . , wn } be the children of v
switch (n) {
case 0:
assert(r = 1 && k = 0);
// no additional adornment for v
case 1:
let count(v) = [c0 , . . . , cn ];
let summands(v) = [s0 , . . . , sn ];
assert(k ≤ n && r ≤ ck );
P
k1 = minj r ≤ ji=0 ski ;
P 1 −1 k
r1 = r − ki=0 si ;
insert-at(v) = k;
UnrankingTreeNoCrossLocal(w1 , r1 , k1 );
case 2:
let count(v) = [c0 , . . . , cn ];
let summands(v) = [s0 , . . . , sn ];
let count(w1 ) = [c10 , . . . , c1n1 ];
let count(w2 ) = [c20 , . . . , c2n2 ];
assert(k ≤ n && r ≤ ck );
P
k1 = minj r ≤ ji=0 ski ;
P 1 −1 k
q = r − ki=0 si ;
k2 = k − k1 ;

(r1 , r2 , a) = UnrankTriplet(q, c1k1 , c2k2 , ki );
α = UnrankDecomposition(a);
merge-using(v) = α;
UnrankingTreeNoCrossLocal(w1 , r1 , k1 );
UnrankingTreeNoCrossLocal(w2 , r2 , k2 );
}
116 CHAPTER 3. JOIN ORDERING

3.3.4 Quick Pick

The QuickPick algorithm of Waas and Pellenkoft [891, 892] does not generate
random join trees in the strong sense but comes close to it and is far easier to
implement and more broadly applicable. The idea is to randomly select an edge
in the query graph and to construct a join tree corresponding to this edge.

QuickPick(Query Graph G)
Input: a query graph G = ({R1 , . . . , Rn }, E)
Output: a bushy join tree
BestTreeFound = any join tree
while stopping criterion not fulfilled {
E 0 = E;
Trees = {R1 , . . . , Rn };
while (|Trees| > 1) {
choose e ∈ E 0 ;
E 0 − = e;
if (e connects two relations in different subtrees T1 , T2 ∈ Trees) {
Trees -= T1 ;
Trees -= T2 ;
Trees += CreateJoinTree(T1 , T2 );
}
}
Tree = single tree contained in Trees;
if (cost(Tree) < cost(BestTreeFound)) {
BestTreeFound = Tree;
}
}
return BestTreeFound

3.3.5 Iterative Improvement

Swami and Gupta [847], Swami [846] and Ioannidis and Kang [445] applied the
idea of iterative improvement to join ordering [445]. The idea is to start from
a random plan and then to apply randomly selected transformations from a
rule set if they improve the current join tree, until not further improvement is
possible.

IterativeImprovementBase(Query Graph G)
Input: a query graph G = ({R1 , . . . , Rn }, E)
Output: a join tree
do {
JoinTree = random tree
JoinTree = IterativeImprovement(JoinTree)
3.3. PROBABILISTIC ALGORITHMS 117

if (cost(JoinTree) < cost(BestTree)) {

BestTree = JoinTree;
}
} while (time limit not exceeded)
return BestTree

IterativeImprovement(JoinTree)
Input: a join tree
Output: improved join tree
do {
JoinTree’ = randomly apply a transformation to JoinTree;
if (cost(JoinTree’) < cost(JoinTree)) {
JoinTree = JoinTree’;
}
} while (local minimum not reached)
return JoinTree

The number of variants of iterative improvements is large. The first parame-

ter is the used rule set. To restrict search to left-deep trees, a rule set consisting
of swap and 3cycle is appropriate [847]. If we consider bushy trees, a complete
set consisting of commutativity, associativity, left join exchange and right join
exchange makes sense. This rule set (proposed by Ioannidis and Kang) is ap-
propriate to explore the whole space of bushy join trees. A second parameter
is how to determine whether the local minimum has been reached. Considering
all possible neighbor states of a join tree is expensive. Therefor, a subset of size
k is sometimes considered. Then, for example, k can be limited to the number
of edges in the query graph [847].

3.3.6 Simulated Annealing

Iterative Improvement suffers from the drawback that it only applies a move
if it improves the current plan. This leads to the problem that one is often
stuck in a local minimum. Simulated annealing tries to avoid this problem by
allowing moves that result in more expensive plans [450, 445, 847]. However,
instead of considering every plan, only those whose cost increase does not exceed
a certain limit are considered. During time, this limit decreases. This general
idea is cast into the notion temperatures and probabilities of performing a
selected transformation. A generic formulation of simulated annealing could
look as follows:

SimulatedAnnealing(Query Graph G)
Input: a query graph G = ({R1 , . . . , Rn }, E)
Output: a join tree
BestTreeSoFar = random tree;
Tree = BestTreeSoFar;
118 CHAPTER 3. JOIN ORDERING

do {
do {
Tree’ = apply random transformation to Tree;
if (cost(Tree’) < cost(Tree)) {
Tree = Tree’;
} else {
0
with probability e−(cost(T ree )−cost(T ree))/temperature
Tree = Tree’;
}
if (cost(Tree) < cost(BestTreeSoFar)) {
BestTreeSoFar = Tree’;
}
} while (equilibrium not reached)
reduce temperature;
} while (not frozen)
return BestTreeSoFar

Besides the rule set used, the initial temperature, the temperature reduc-
tion, and the definitions of equilibrium and frozen determine the algorithm’s
behavior. For each of them several alternatives have been proposed in the lit-
erature. The starting temperature can be calculated as follows: determine the
standard deviation σ of costs by sampling and multiply it with a constant val-
ue ([847] use 20). An alternative is to set the starting temperature twice the
cost of the first randomly selected join tree [445] or to determine the starting
temperature such that at least 40% of all possible transformations are accepted
[823].
For temperature reduction, we can apply the formula temp∗ = 0.975 [445]
λt
or max(0.5, e− σ ) [847].
The equilibrium is defined to be reached if for example the cost distribution
of the generated solutions is sufficiently stable [847], the number of iterations is
sixteen times the number of relations in the query [445], or number of iterations
is the same as the number of relations in the query [823].
We can establish frozenness if the difference between the maximum and
minimum costs among all accepted join trees at the current temperature equals
the maximum change in cost in any accepted move at the current temperature
[847], the current solution could not be improved in four outer loop iterations
and the temperature has been fallen below one [445], or the current solution
could not be improved in five outer loop iterations and less than two percent of
the generated moves were accepted [823].
Considering databases are used in mission critical applitions. Would you
bet your business on these numbers?

3.3.7 Tabu Search

Morzy, Matyasiak and Salza applied Tabu Search to join ordering [618]. The
general idea is that among all neighbors reachable via the transformations, only
3.3. PROBABILISTIC ALGORITHMS 119

the cheapest is considered even if its cost are higher than the costs of the current
join tree. In order to avoid running into cycles, a tabu set is maintained. It
contains the last join trees generated, and the algorithm is not allowed to visit
them again. This way, it can escape local minima, since eventually all nodes in
the valley of a local minimum will be in the tabu set. The stopping conditions
could be that there ws no improvement over the current best solution found
during the last given number of iterations or if the set neighbors minus the tabu
set is empty (in line (*)).
Tabu Search looks as follows:

TabuSearch(Query Graph)
Input: a query graph G = ({R1 , . . . , Rn }, E)
Output: a join tree
Tree = random join tree;
BestTreeSoFar = Tree;
TabuSet = ∅;
do {
Neighbors = all trees generated by applying a transformation to Tree;
Tree = cheapest in Neighbors \ TabuSet; (*)
if (cost(Tree) < cost(BestTreeSoFar)) {
BestTreeSoFar = Tree;
}
if(|TabuSet| > limit) remove oldest tree from TabuSet;
TabuSet += Tree;
} while (not stopping condition satisfied);
return BestTreeSoFar;

3.3.8 Genetic Algorithms

Genetic algorithms are inspired by evolution: only the fittest survives [327].
They work with a population that evolves from generation to generation. Suc-
cessors are generated by crossover and mutation. Further, a subset of the cur-
rent population (the fittest) are propagated to the next generation (selection).
The first generation is generated by a random generation process.
The problem is how to represent each individual in a population. The
following analogies are used:

• Chromosome ←→ string

• Gene ←→ character

In order to solve an optimization problem with genetic algorithms, an encoding

is needed as well as a specification for selection, crossover, and mutation.
Genetic algorithms for join ordering have been considered in [73, 823]. We
first introduce alternative encodings, then come to the selection process, and
finally discuss crossover and mutation.
120 CHAPTER 3. JOIN ORDERING

Encodings We distinguish ordered list and ordinal number encodings. Both

encodings are used for left-deep and bushy trees. In all cases we assume that
the relations R1 , . . . , Rn are to be joined and use the index i to denote Ri .

1. Ordered List Encoding

(a) left-deep trees

A left-deep join tree is encoded by a permutation of 1, . . . , n. For
instance, (((R1 B R4 ) B R2 ) B R3 ) is encoded as “1423”.
(b) bushy trees
Bennet, Ferris, and Ioannidis proposed the following encoding scheme
[73, 74]. A bushy join-tree without cartesian products is encoded as
an ordered list of the edges in the join graph. Therefore, we num-
ber the edges in the join graph. Then the join tree is encoded in a
bottom-up, left-to-right manner. See Figure 3.22 for an example.

2. Ordinal Number Encoding

(a) left-deep trees

A join tree is encoded by using a list of relations that is short-
ened whenever a join has been encoded. We start with the list
L = hR1 , . . . , Rn i. Then within L we find the index of first rela-
tion to be joined. Let this relation be Ri . Ri is the i-th relation
in L. Hence, the first character in the chromosome string is i. We
eliminate Ri from L. For every subsequent relation joined, we again
determine its index in L, remove it from L and append the index to
the chromosome string. For instance, starting with hR1 , R2 , R3 , R4 i,
the left-deep join tree (((R1 B R4 ) B R2 ) B R3 ) is encoded as “1311”.
(b) bushy trees
Again, we start with the list L = hR1 , . . . , Rn i and encode a bushy
join tree in a bottom-up, left-to-right manner. Let Ri B Rj be the
first join in the join tree under this ordering. Then we look up their
positions in L and add them to the encoding. Next we eliminate Ri
and Rj from L and push Ri,j to the front of it. We then proceed
for the other joins by again selecting the next join which now can
be between relations and/or subtrees. We determine their position
within L, add these positions to the encoding, remove them from L,
and insert a composite relation into L such that the new composite
relation directly follows those already present. For instance, starting
with the list hR1 , R2 , R3 , R4 i, the bushy join tree ((R1 B R2 ) B (R3 B
R4 )) is encoded as “12 23 12”.

The encoding is completed by adding join methods.

Crossover A crossover generates a new solution from two individuals. There-

fore, two partial solutions are combined. Obviously, its definition depends on
the encoding. Two kinds of crossovers are distinguished: the subsequence and
the subset exchange.
3.3. PROBABILISTIC ALGORITHMS 121

1 R2
B
R1 2

B B 1243
R3

3
B R3 R4 R5
4
R5 R4
R1 R2

Figure 3.22: A query graph, a join tree, and its encoding

The subsequence exchange for the ordered list encoding works as follows.
Assume two individuals with chromosomes u1 v1 w1 and u2 v2 w2 . From these we
generate u1 v10 w1 and u2 v20 w2 , where vi0 is a permutation of the relations in vi
such that the order of their appearence is the same as in u3−i v3−i w3−i . In order
to adapt the subsequence exchange operator to the ordinal number encoding,
we have to require that the vi are of equal length (|v1 | = |v2 |) and occur at the
same offset (|u1 | = |u2 |). We then simply swap the vi . That is, we generate
u1 v2 w1 and u2 v1 w2 .
The subset exchange is defined only for the ordered list encoding. Within
the two chromosomes, we find two subsequences of equal length comprising the
same set of relations. These sequences are then simply exchanged.

Mutation A mutation randomly alters a character in the encoding. If du-

plicates must not occur — as in the ordered list encoding — swapping two
characters is a perfect mutation.

Selection The probability of a join tree’s survival is determined by its rank

in the population. That is, we calculate the costs of the join trees encoded
for each member of the population. Then we sort the population according
to their associated costs and assign probabilities to each individual such that
the best solution in the population has the highest probability to survive and
so on. After probabilities have been assigned, we randomly select members of
the population taking these probabilities into account. That is, the higher the
probability of a member, the higher is its chance to survive.

Algorithm The genetic algorithm then works as follows. First, we create a

random population of a given size (say 128). We apply crossover and mutation
with a given rate, for example such that 65% of all members of a population
participate in crossover, and 5% of all members of a population are subject to
random mutation. Then we apply selection until we again have a population
122 CHAPTER 3. JOIN ORDERING

of a given size. We stop after we have not seen an improvement within the
population for a fixed number of iterations (say 30).

3.4 Hybrid Algorithms

All the algorithms we have seen so far can be combined to result in new ap-
proaches to join ordering. Some of the numerous possibilities have been de-
scribed in the literature. We present them.

3.4.1 Two Phase Optimization

Two phase optimization combines Iterative Improvement with Simulated An-
nealing [445]. For a number of randomly generated initial trees, Iterative Im-
provement is used to find a local minimum. Then Simulated Annealing is start-
ed to find a better plan in the neighborhood of the local minima. The initial
temperature of Simulated Annealing can be lower as is its original variants.

3.4.2 AB-Algorithm
The AB-Algorithm was developed by Swami and Iyer [848, 849]. It builds on
the IKKBZ-Algorithm by resolving its limitations. First, if the query graph
is cyclic, a spanning tree is selected. Second, two different cost functions for
joins (join methods) are supported by the AB-Algorithm: nested loop join and
sort merge join. In order to make the sort merge join’s cost model fit the ASI
property, it is simplified. Third, join methods are assigned randomly before
IKKBZ is called. Afterwards, an iterative improvement phase follows. The
algorithm can be formulated as follows:

AB(Query Graph G)
Input: a query graph G = ({R1 , . . . , Rn }, E)
Output: a left-deep join tree
while (number of iterations ≤ n2 ) {
if G is cyclic take spanning tree of G
randomly attach a join method to each relation
JoinTree = result of IKKBZ
while (number of iterations ≤ n2 ) {
apply Iterative Improvement to JoinTree
}
}
return best tree found

3.4.3 Toured Simulated Annealing

Lanzelotte, Valduriez, and Zäit introduced toured simulated annealing as a
search strategy useful in distributed databases where the search space is even
3.5. ORDERING ORDER-PRESERVING JOINS 123

larger than in centralized systems [524]. The basic idea is that simulated an-
nealing is called n times with different initial join trees, if n is the number of
relations to be joined. Each join sequence in the set Solutions produced by
GreedyJoinOrdering-3 is used to start an independent run of simulated an-
nealing. As a result, the starting temperature can be decreased to 0.1 times
the cost of the initial plan.

3.4.4 GOO-II
GOO-II appends an Iterative Improvement step to the GOO-Algorithm.

3.4.5 Iterative Dynamic Programming

Iterative Dynamic Programming combines heuristics with dynamic program-
ming in order to overcome the deficiencies of both. It comes in two variants
[507, 795]. The first variant, IDP-1 (see Figure 3.23), first creates all join trees
which contain up to k relations where k is a parameter of the algorithm. After
this step, it selects the cheapest join tree comprising k relations, replaces it by
a new compound relation and starts all over again. The iteration stops, when
only one compound relation representing a join tree for all relations remains in
the ToDo list.
The second variant, IDP-2 (see Figure 3.24), works the other way round.
It first applies a greedy heuristics to build join trees of size up to k. To the
larger subtree it applies dynamic programming to improve it. The result of
the optimized outcome of the greedy algorithm is then encapsulated in a new
compound relation which replaces its constituent relations in the ToDo list. The
algorithm then iterates until only one entry remains in the ToDo list.
Obviously, from these two basic variants several others can be derived. A
systematic investigation of the basic algorithms and their variants is given by
Kossmann and Stocker [507]. It turns out that the most promising variants
exist for IDP-1.

3.5 Ordering Order-Preserving Joins

This section covers an algorithm for ordering order-preserving joins [604]. This
is important for XQuery and other languages that require order-preservation.
XQuery specifies that the result of a query is a sequence. If no unordered or
order by instruction is given, the order of the output sequence is determined
by the order of the input sequences given in the for clauses of the query. If
there are several entries in a for clause or several for clauses, order-preserving
join operators [179] can be a natural component for the evaluation of such a
query.
The order-preserving join operator is used in several algebras in the context
of

• semi-structured data and XML (e.g. SAL [69], XAL [289]),

• OLAP [809], and

124 CHAPTER 3. JOIN ORDERING

IDP-1({R1 , . . . , Rn }, k)
Input: a set of relations to be joined, maximum block size k
Output: a join tree
for (i = 1; i <= n; ++i) {
BestTree({Ri }) = Ri ;
}
ToDo = {R1 , . . . , Rn };
while (|ToDo| > 1) {
k = min(k, |ToDo|);
for (i = 2; i < k; ++i) {
for all S ⊆ ToDo, |S| = i do {
for all O ⊂ S do {
BestTree(S) = CreateJoinTree(BestTree(S \ O), BestTree(O));
}
}
}
find V ⊂ ToDo, |V | = k
with cost(BestTree(V )) = min{cost(BestTree(W )) | W ⊂ ToDo, |W | = k};
generate new symbol T ;
BestTree({T }) = BestTree(V );
ToDo = (ToDo \ V ) ∪ {T };
for all O ⊂ V do delete(BestTree(O));
}
return BestTree({R1 , . . . , Rn });

Figure 3.23: Pseudo code for IDP-1

• time series data [535].

We give a polynomial algorithm that produces bushy trees for a sequence of

order-preserving joins and selections. These trees may contain cross products
even if the join graph is connected. However, we apply selections as early as
possible. The algorithm then produces the optimal plan among those who push
selections down. The cost function is a parameter of the algorithm, and we do
not need to restrict ourselves to those having the ASI property. Further, we
need no restriction on the join graph, i.e. the algorithm produces the optimal
plan even if the join graph is cyclic.
Before defining the order-preserving join, we need some preliminaries. The
above algebras work on sequences of sets of variable bindings, i.e. sequences of
unordered tuples where every attribute corresponds to a variable. (See Chap-
ter 7.16 for a general discussion.) Single tuples are constructed using the stan-
dard [·] brackets. Concatenation of tuples and functions is denoted by ◦. The
set of attributes defined for an expression e is defined as A(e). The set of
free variables of an expression e is defined as F(e). For sequences e, we use
3.5. ORDERING ORDER-PRESERVING JOINS 125

IDP-2({R1 , . . . , Rn }, k)
Input: a set of relations to be joined, maximum block size k
Output: a join tree
for (i = 1; i <= n; ++i) {
BestTree({Ri }) = Ri ;
}
ToDo = {R1 , . . . , Rn };
while (|ToDo| > 1) {
// apply greedy algorithm to select a good building block
B = ∅;
for all v ∈ ToDo, do {
B += BestTree({v});
}
do {
find L, R ∈ B
with cost(CreateJoinTree(L,R))
= min{cost(CreateJoinTree(L0 ,R0 )) | L0 , R0 ∈ B};
P = CreateJoinTree(L,R));
B = (B \ {L, R}) ∪ {P };
} while (P involves no more than k relations and |B| > 1);
// reoptimize the bigger of L and R,
// selected in the last iteration of the greedy loop
if (L involves more tables than R) {
ReOpRels = relations involved in L;
} else {
ReOpRels = relations involved in R;
}
P = DP-Bushy(ReOpRels);
generate new symbol T ;
BestTree({T }) = P ;
ToDo = (ToDo \ ReOpRels) ∪ {T };
for all O ⊂ V do delete(BestTree(O));
}
return BestTree({R1 , . . . , Rn });

Figure 3.24: Pseudocode for IDP-2

α(e) to denote the first element of a sequence. We identify single element se-
quences with elements. The function τ retrieves the tail of a sequence, and ⊕
concatenates two sequences. We denote the empty sequence by .
We define the algebraic operators recursively on their input sequences. The
order-preserving join operator is defined as the concatenation of an order-
preserving selection and an order-preserving cross product. For unary oper-
ators, if the input sequence is empty, the output sequence is also empty. For
126 CHAPTER 3. JOIN ORDERING

binary operators, the output sequence is empty whenever the left operand rep-
resents an empty sequence.
The order-preserving join operator is based on the definition of an order-
preserving cross product operator defined as
ˆ ) ⊕ (τ (e )×e
ˆ 2 := (α(e1 )Ae
e1 ×e 2 1 ˆ 2)

where (
ˆ := if e2 =
e1 Ae2 ˆ
(e1 ◦ α(e2 )) ⊕ (e1 Aτ (e2 )) else
We are now prepared to define the join operation on ordered sequences:

ˆ 2)
e1 B̂p e2 := σ̂p (e1 ×e

where the order-preserving selection is defined as


 if e =
σ̂p (e) := α(e) ⊕ σ̂p (τ (e)) if p(α(e))

σ̂p (τ (e)) else

As usual, selections can be reordered and pushed inside order-preserving

joins. Besides, the latter are associative. The following equivalences formalize
this.
σ̂p1 (σ̂p2 (e)) = σ̂p2 (σ̂p1 (e))
σ̂p1 (e1 B̂p2 e2 ) = σ̂p1 (e1 )B̂p2 e2 if F(p1 ) ⊆ A(e1 )
σ̂p1 (e1 B̂p2 e2 ) = e1 B̂p2 σ̂p1 (e2 ) if F(p1 ) ⊆ A(e2 )
e1 B̂p1 (e2 B̂p2 e3 ) = (e1 B̂p1 e2 )B̂p2 e3 if F(pi ) ⊆ A(ei ) ∪ A(ei+1 )

While being associative, the order-preserving join is not commutative, as the

following example illustrates. Given two tuple sequences R1 = h[a : 1], [a : 2]i
and R2 = h[b : 1], [b : 2]i, we have

R1 B̂true R2 = h[a : 1, b : 1], [a : 1, b : 2], [a : 2, b : 1], [a : 2, b : 2]i

R2 B̂true R1 = h[a : 1, b : 1], [a : 2, b : 1], [a : 1, b : 2], [a : 2, b : 2]i

Before introducing the algorithm, let us have a look at the size of the search
space. Since the order-preserving join is associative but not commutative, the
input to the algorithm must be a sequence of join operators or, likewise, a
sequence of relations to be joined. The output is then a fully parenthesized
expression. Given a sequence of n binary associative but not commutative
operators, the number of fully parenthesized expressions is (see [204])

1 if n = 1
P (n) = Pn−1
k=1 P (k)P (n − k) if n>1

We have that P (n)

1 2n
= C(n − 1), where 4C(n)
n
are the Catalan numbers defined
as C(n) = n+1 n . Since C(n) = Ω( n3/2 ), the search space is exponential in
size.
3.5. ORDERING ORDER-PRESERVING JOINS 127

applicable-predicates(R, P)

01 B=∅
02 foreach p ∈ P
03 IF (F(p) ⊆ A(R))
04 B+ = p
05 return B

Figure 3.25: Subroutine applicable-predicates

The algorithm is inspired by the dynamic programming algorithm for finding

optimal parenthesized expressions for matrix-chain multiplication [204]. The
differences are that we have to encapsulate the cost function and deal with
selections. We give a detailed example application of the algorithm below.
This example illustrates (1) the optimization potential, (2) that cross products
can be favorable, (3) how to plug in a cost function into the algorithm, and (4)
the algorithm itself.
The algorithm itself is broken up into several subroutines. The first is
applicable-predicates (see Fig. 3.25). Given a sequence of relations Ri , . . . , Rj
and a set of predicates, it retrieves those predicates applicable to the result of
the join of the relations. Since joins and selections can be reordered freely, the
only condition for a predicate to be applicable is that all its free variables are
bound by the given relations.
The second subroutine is the most important and intrigued. It fills several
arrays with values in a bottom-up manner. The third subroutine then builds
the query evaluation plan using the data in the arrays.
The subroutine construct-bushy-tree takes as input a sequence R1 , . . . , Rn
of relations to be joined and a set P of predicates to be applied. For every
possible subsequence Ri , . . . , Rj , the algorithm finds the best plan to join these
relations. Therefor, it determines some k such that the cheapest plan joins
the intermediate results for Ri , . . . , Rk and Rk+1 , . . . , Rj by its topmost join.
For this it is assumed that for all k the best plans for joining Ri , . . . , Rk and
Rk+1 , . . . , Rj are known. Instead of directly storing the best plan, we remember
(1) the costs of the best plan for Ri , . . . , Rj for all 1 ≤ i ≤ j ≤ n and (2) the
k where the split takes place. More specifically, the array c[i, j] contains the
costs of the best plan for joining Ri , . . . , Rj , and the array t[i, j] contains the
k such that this best plan joins Ri , . . . , Rk and Rk+1 , . . . , Rj with its topmost
join. For every sequence Ri , . . . , Rj , we also remember the set of predicates
that can be applied to it, excluding those that have been applied earlier. These
applicable predicates are contained in p[i, j]. Still, we are not done. All cost
functions we know use some kind of statistics on the argument relation(s) in
order to compute the costs of some operation. Since we want to be generic with
respect to the cost function, we encapsulate the computation of statistics and
costs within functions S0 , C0 , S1 , and C1 . The function S0 retrieves statistics
for base relations. The function C0 computes the costs of retrieving (part of) a
base relation. Both functions take a set of applicable predicates as an additional
128 CHAPTER 3. JOIN ORDERING

construct-bushy-tree(R, P)

01 n = |R|
02 for i = 1 to n
03 B =applicable-predicates(Ri , P)
04 P =P \B
05 p[i, i] = B
06 s[i, i] = S0 (Ri , B)
07 c[i, i] = C0 (Ri , B)
08 for l = 2 to n
09 for i = 1 to n − l + 1
10 j =i+l−1
11 B = applicable-predicates(Ri...j , P)
12 P =P \B
13 p[i, j] = B
14 s[i, j] = S1 (s[i, j − 1], s[j, j], B)
15 c[i, j] = ∞
16 for k = i to j − 1
17 q = c[i, k] + c[k + 1, j] + C1 (s[i, k], s[k + 1, j], B)
18 IF (q < c[i,j])
19 c[i, j] = q
20 t[i, j] = k

Figure 3.26: Subroutine construct-bushy-tree

extract-plan(R, t, p)

01 return extract-subplan(R, t, p, 1, |R|)

extract-subplan(R, t, p, i, j)

01 IF (j > i)
02 X = extract-subplan(R, t, p, i, t[i, j])
03 Y = extract-subplan(R, t, p, t[i, j] + 1, j)
04 return X B̂p[i,j] Y
05 else
06 return σ̂p[i,i] (Ri )

Figure 3.27: Subroutine extract-plan and its subroutine

argument. The function S1 computes the statistics for intermediate relations.

Since the result of joining some relations Ri , . . . , Rj may occur in many different
plans, we compute it only once and store it in the array s. C1 computes the
costs of joining two relations and applying a set of predicates. Below, we show
how concrete (simple) cost and statistics functions can look like.
Given the above, the algorithm (see Fig. 3.26) fills the arrays in a bottom-up
manner by first computing for every base relation the applicable predicates, the
3.5. ORDERING ORDER-PRESERVING JOINS 129

statistics of the result of applying the predicates to the base relation and the
costs for computing these intermediate results, i.e. for retrieving the relevant
part of the base relation and applying the predicates (lines 02-07). Note that
this is not really trivial if there are several index structures that can be applied.
Then computing C0 involves considering different access paths. Since this is an
issue orthogonal to join ordering, we do not detail on it.
After we have the costs and statistics for sequences of length one, we com-
pute the same information for sequences of length two, three, and so on until
n (loop starting at line 08). For every length, we iterate over all subsequences
of that length (loop starting at line 09). We compute the applicable predicates
and the statistics. In order to determine the minimal costs, we have to consider
every possible split point. This is done by iterating the split point k from i to
j − 1 (line 16). For every k, we compute the cost and remember the k that
resulted in the lowest costs (lines 17-20).
The last subroutine takes the relations, the split points (t), and the applica-
ble predicates (p) as its input and extracts the plan. The whole plan is extracted
by calling extract-plan. This is done by instructing extract-subplan to re-
trieve the plan for all relations. This subroutine first determines whether the
plan for a base relation or that of an intermediate result is to be constructed.
In both cases, we did a little cheating here to keep things simple. The plan we
construct for base relations does not take the above-mentioned index structures
into account but simply applies a selection to a base relation instead. Obvi-
ously, this can easily be corrected. We also give the join operator the whole
set of predicates that can be applied. That is, we do not distinguish between
join predicates and other predicates that are better suited for a selection sub-
sequently applied to a join. Again, this can easily be corrected.
Let us have a quick look at the complexity of the algorithm. Given n rela-
tions with m attributes in total and p predicates, we can implement applicable-predicates
in O(pm) by using a bit vector representation for attributes and free variables
and computing the attributes for each sequence Ri , . . . , Rj once upfront. The
latter takes O(n2 m).
The complexity of the routine construct-bushy-tree is determined by the
three nested loops. We assume that S1 and C1 can be computed in O(p), which
is quite reasonable. Then, we have O(n3 p) for the innermost loop, O(n2 ) calls to
applicable-predicates, which amounts to O(n2 pm), and O(n2 p) for calls of
S1 . Extracting the plan is linear in n. Hence, the total runtime of the algorithm
is O(n2 (n + m)p)
In order to illustrate the algorithm, we need to fix the functions S0 , S1 , C0
and C1 . We use the simple cost function Cout . As a consequence, the array s
simply stores cardinalities, and S0 has to extract the cardinality of a given base
relation and multiply it by the selectivities of the applicable predicates. S1 mul-
tiplies the input cardinalities with the selectivities of the applicable predicates.
We set C0 to zero and C1 to S1 . The former is justified by the fact that every
relation must be accessed exactly once and hence, the access costs are equal in
130 CHAPTER 3. JOIN ORDERING

all plans. Summarizing, we define

Y
S0 (R, B) := |R| f (p)
p∈B
Y
S1 (x, y, B) := xy f (p)
p∈B
C0 (R, B) := 0
C1 (x, y, B) := S1 (x, y, B)

where B is a set of applicable predicates and for a single predicate p, f (p)

returns its selectivity.
We illustrate the algorithm by an example consisting of four relations R1 , . . . , R4
with cardinalities |R1 | = 200, |R2 | = 1, |R3 | = 1, |R4 | = 20. Besides, we have
three predicates pi,j with F(pi,j ) ⊆ A(Ri ) ∪ A(Rj ). They are p1,2 , p3,4 , and p1,4
with selectivities 1/2, 1/10, 1/5.
Let us first consider an example plan and its costs. The plan

((R1 B̂p1,2 R2 )B̂true R3 )B̂p1,4 ∧p3,4 R4

has the costs 240 = 100 + 100 + 40.

For our simple cost function, the algorithm construct-bushy-tree will fill
the array s with the initial values:

s
200
1
1
20

After initilization, the array c has 0 everywhere in its diagonal and the array p
empty sets.
For l = 2, the algorithm produces the following values:

l i j k s[i,j] q current c[i,j] current t[i,j]

2 1 2 1 100 100 100 1
2 2 3 2 1 1 1 2
2 3 4 3 2 2 2 3

For l = 3, the algorithm produces the following values:

l i j k s[i,j] q current c[i,j] current t[i,j]

3 1 3 1 200 101 101 1
3 1 3 2 200 200 101 1
3 2 4 2 2 4 4 2
3 2 4 3 2 3 3 3

For l = 4, the algorithm produces the following values:

3.6. CHARACTERIZING SEARCH SPACES 131

l i j k s[1,4] q current c[1,4] current t[1,4]

4 1 4 1 40 43 43 1
4 1 4 2 40 142 43 1
4 1 4 3 40 141 43 1

where for each k the value of q (in the following table denoted by qk ) is deter-
mined as follows:

q1 = c[1, 1] + c[2, 4] + 40 = 0 + 3 + 40 = 43
q2 = c[1, 2] + c[3, 4] + 40 = 100 + 2 + 40 = 142
q3 = c[1, 3] + c[4, 4] + 40 = 101 + 0 + 40 = 141

Collecting all the above t[i, j] values leaves us with the following array as
input for extract-plan:

i\j 1 2 3 4
1 1 1 1
2 2 3
3 3
4

The function extract-plan merely calls extract-subplan. For the latter,

we give the call hierarchy and the result produced:

000 extract-plan(. . ., 1, 4)
100 extract-plan(. . ., 1, 1)
200 extract-plan(. . ., 2, 4)
210 extract-plan(. . ., 2, 3)
211 extract-plan(. . ., 2, 2)
212 extract-plan(. . ., 3, 3)
210 return (R2 B̂true R3 )
220 extract-plan(. . ., 4, 4)
200 return ((R2 B̂true R3 )B̂p3,4 R4 )
000 return (R1 B̂p1,2 ∧p1,4 ((R2 B̂true R3 )B̂p3,4 R4 ))

The total cost of this plan is c[1, 4] = 43.

3.6 Characterizing Search Spaces

3.6.1 Complexity Thresholds
The complexity results presented in Section 3.1.6 show that most classes of join
ordering problems are NP-hard. However, it is quite clear that some instances
of the join ordering problem are simpler than others. For example, consider a
query graph which is a clique in n relations R1 , . . . , Rn . Further assume that
each Ri has cardinality 2i and all join selectivities are 1/2 (i.e. fi,j = 1/2 for
all 1 ≤ i, j ≤ n, i 6= j). Obviously, this problem is easy to optimize although
the query graph is clique. In this section we present some ideas on how the
132 CHAPTER 3. JOIN ORDERING

complexity of instances of the join ordering problem is influenced by certain

parameters.
How can we judge the complexity of a single instance of a join ordering prob-
lem? Using standard complexity theory, for single problem instances we easily
derive an algorithm that works in Θ(1). Hence, we must define other complexity
measures. Consider our introductory join ordering problem. A simple greedy
algorithm that orders relations according to their cardinality produces an op-
timal solution for it. Hence, one possibility to define the problem complexity
would be how far a solution produced by typical heuristics for join ordering
differ from the optimal solution. Another possibility is to use randomized al-
gorithms like iterative improvement of simulated annealing and see how far the
plans generated by them deviate from the optimal plan. These approaches have
the problem that the results may depend on the chosen algorithm. This can
be avoided by using the following approach. For each join ordering problem
instance, we compute the fraction of good plans compared to all plans. There-
for, we need a measure of “good”. Typical examples thereof would be to say a
plan is “good” if it does not deviate more than 10% or a factor of two from the
optimal plan.
If these investigations were readily available, there are certain obvious ben-
efits [503]:
1. The designer of an optimizer can classify queries such that heuristics are
applied where they guarantee success; cases where they are bound to fail
can be avoided. Furthermore, taking into account the vastly different run
time of the different join ordering heuristics and probabilistic optimization
procedures, the designer of an optimizer can choose the method that
achieves a satisfactory result with the least effort.
2. The developer of search procedures and heuristics can use this knowl-
edge to design methods solving hard problems (as exemplified for graph
coloring problems [424]).
3. The investigator of different join ordering techniques is able to (1) con-
sciously design challenging benchmarks and (2) evaluate existing bench-
marks according to their degree of challenge.
The kind of investigation presented in this section first started in the context
of artificial intelligence where a paper by Cheeseman, Kanefsky, and Taylor
[159] spurred a whole new branch of research where the measures to judge
the complexity of problem instances was investigated for many different NP-
complete problems like satisfiability [159, 208, 322, 598], graph coloring [159],
Hamiltonian circuits [159], traveling salesman [159], and constraint satisfaction
[915].
We only present a small fraction of all possible investigations. The restric-
tions are that we do not consider all parameters that possibly influence the
problem complexity, we only consider left-deep trees, and we restrict ourselves
to the cost function Chj . The join graphs are randomly generated. Starting
with a circle, we randomly added edges until a clique is reached. The read-
er is advised to carry out his or her own experiments. Therefor, the following
3.6. CHARACTERIZING SEARCH SPACES 133

pointer into the literature might be useful. Lanzelotte and Valduriez provide an
object-oriented design for search strategies [522]. This allows easy modification
and even the exchange of the plan generator’s search strategy.

Search Space Analysis

The goal of this section is to determine the influence of the parameters on the
search space of left-deep join trees. More specifically, we are interested in how a
variation of the parameters changes the percentage of good solutions among all
solutions. The quality of a solution is measured by the factor its cost deviates
from the optimal permutation. For this, all permutations have to be gener-
ated and evaluated. The results of this experiment are shown in Figures 3.28
and 3.29. Each single curve accumulates the percentage of all permutations
deviating less than a certain factor (given as the label) from the optimum. The
accumulated percentages are given at the y-axes, the connectivity at the x-axes.
The connectivity is given by the number of edges in the join graph. The curves
within the figures are organized as follows. Figure 3.28 (3.29) shows varying
mean selectivity values (relation sizes) and variances where the mean selectivity
values (relation sizes) increase from top to bottom and the variances increase
from left to right.
Note that the more curves are visible and the lower their y-values, the harder
is the problem. We observe the following:

• all curves exhibit a minimum value at a certain connectivity

• which moves with increasing mean values to the right;
• increasing variances does not have an impact on the minimum connectiv-
ity,
• problems become less difficult with increasing mean values.

These findings can be explained as follows. With increasing connectivity,

the join ordering problem becomes more complex up to a certain point and
then less complex again. To see this, consider the following special though
illustrative case. Assume an almost equal distribution of the costs of all al-
ternatives between the worst case and optimal costs, equal relation sizes, and
equal selectivities. Then the optimization potential worst case/optimum is 1
for connectivity 0 and cliques. In between, there exists a connectivity exhibit-
ing the maximum optimization potential. This connectivity corresponds to the
minimum connectivity of Figures 3.28 and 3.29.
There is another factor which influences the complexity of a single problem
instance. Consider joining n relations. The problem becomes less complex
if after joining i < n relations the intermediate result becomes so small that
the accumulated costs of the subsequent n − i joins are small compared to the
costs of joining the first i relations. Hence, the ordering of the remaining n − i
relations does not have a big influence on the total costs. This is the case
for very small relations, small selectivities, or high connectivities. The greater
selectivities and relation sizes are, the more relations have to be joined to reach
this critical size of the intermediate result. If the connectivity is enlarged, this
134 CHAPTER 3. JOIN ORDERING

critical size is reached earlier. Since the number of selectivities involved in the
first few joins is small regardless of the connectivity, there is a lower limit to the
number of joined relations required to arrive at the critical intermediate result
size. If the connectivity is larger, this point is reached earlier, but there exists
a lower limit on the connectivity where this point is reached. The reason for
this lower limit is that the number of selectivities involved in the joins remains
small for the first couple of relations, independent of their connectivity. These
lines of argument explain subsequent findings, too.
The reader should be aware of the fact that the number of relations joined is
quite small (10) in our experiments. Further, as observed by several researchers,
if the number of joins increases, the number of “good” plans decreases [298, 845].
That is, increasing the number of relations makes the join ordering problem
more difficult.

Figure 3.28: Impact of selectivity on the search space

Figure 3.29: Impact of relation sizes on the search space

Heuristics
For analyzing the influence of the parameters on the performance of heuristics,
we give the figures for four different heuristics. The first two are very simple.
The minSel heuristic selects those relations first of which incident join edges
exhibit the minimal selectivity. The recMinRel heuristic chooses those relations
first which result in the smallest intermediate relation.
We also analyzed the two advanced heuristics IKKBZ and RDC . The IKKBZ
heuristic [512] is based on an optimal join ordering procedure [432, 512] which
is applied to the minimal spanning tree of the join graph where the edges are
labeled by the selectivities. The family of RDC heuristics is based on the rela-
tional difference calculus as developed in [412]. Since our goal is not to bench-
mark different heuristics in order to determine the best one, we have chosen
the simplest variant of the family of RDC based heuristics. Here, the relations
are ordered according to a certain weight whose actual computation is—for
the purpose of this section—of no interest. The results of the experiments are
presented in Figure 3.30.
On a first glance, these figures look less regular than those presented so far.
This might be due to the non-stable behavior of the heuristics. Nevertheless,
we can extract the following observations. Many curves exhibit a peak at a
certain connectivity. Here, the heuristics perform worst. The peak connectivity
is dependent on the selectivity size but not as regular as in the previous curves.
Further, higher selectivities flatten the curves, that is, heuristics perform better
at higher selectivities.

Figure 3.30: Impact of parameters on the performance of heuristics

3.7. DISCUSSION 135

Probabilistic Optimization Procedures

Figure 3.31 shows four pictures corresponding to simulated annealing (SA),
iterative improvement (II), iterative improvement applied to the outcome of
the IKKBZ heuristic (IKKBZ/II) and the RDC heuristic (RDC/II) [412]. The
patterns shown in Figure 3.31 are very regular. All curves exhibit a peak
at a certain connectivity. The peak connectivities typically coincide with the
minimum connectivity of the search space analysis. Higher selectivities result
in flatter curves; the probabilistic procedures perform better. These findings
are absolutely coherent with the search space analysis. This is not surprising,
since the probabilistic procedures investigate systematically —although with
some random influence— a certain part of the search space.
Given a join ordering problem, we can describe its potential search space as
a graph. The set of nodes consists of the set of join trees. For every two join
trees a and b, we add an edge (a, b) if b can be reached from a by one of the
transformation rules used in the probabilistic procedure. Further, with every
node we can associate the cost its corresponding join tree.
Having in mind that the probabilistic algorithms are always in danger of
being stuck in a local minima, the following two properties of the search space
are of interest:

1. the cost distribution of local minima, and

2. the connection cost of low local minima.

Of course, if all local minima are of about the same cost, we do not have to
worry, otherwise we do. It would be very interesting to know the percentage of
local minima that are close to the global minima.
Concerning the second property, we first have to define the connection cost.
Let a and b be two nodes and P be the set of all paths from a to b. The
connection cost of a and b is then defined as minp∈P maxs∈p {cost(s)|s 6= a, s 6=
b}. Now, if the connection costs are high, we know that if we have to travel
from one local minima to another, there is at least one node we have to pass
which has high costs. Obviously, this is bad for our probabilistic procedures.
Ioannidis and Kang [446] call a search graph that is favorable with respect to
the two properties a well . Unfortunately, investigating these two properties
of real search spaces is rather difficult. However, Ioannidis and Kang, later
supported by Zhang, succeeded in characterizing cost wells in random graphs
[446, 447]. They also conclude that the search space comprising bushy trees is
better w.r.t. our two properties than the one for left-deep trees.

Figure 3.31: Impact of selectivities on probabilistic procedures

3.7 Discussion
Choose one of dynamic programming, memoization, permutations as the core
of your plan generation algorithm and extend it with the rest of book. ToDo
136 CHAPTER 3. JOIN ORDERING

3.8 Bibliography
ToDo: Oezsu, Meechan [650, 651]
Chapter 4

Database Items, Building

Blocks, and Access Paths

In this chapter we go down to the storage layer and discuss leaf nodes of query
execution plans and plan fragments. We briefly recap some notions, but reading
a book on database implementation might be helpful [397, 312]. Although
alternative storage technologies exist and are being developed [752], databases
are mostly stored on disks. Thus, we start out by introducing a simple disk
model to capture I/O costs. Then, we say some words about database buffers,
physical data organization, slotted pages and tuple identifiers (TIDs), physical
record layout, physical algebra, and the iterator concept. These are the basic
notions in order to start with the main purpose of this section: giving an
overview over the possibilities available to structure the low level parts of a
physical query evaluation plan. In order to calculate the I/O costs of these plan
fragments, a more sophisticated cost model for several kinds of disk accesses is
introduced.

4.1 Disk Drive

Figure 4.1 shows a top and a side view of a typical disk. A disk consists of
several platters that rotate around the spindle at a fixed speed. The platters
are coated with a magnetic material on at least one of their surfaces. All coated
sides are organized into the same pattern of concentric circles. One concentric
circle is called a track. All the tracks residing exactly underneath and above
each other form a cylinder. We assume that there is only one read/write head
for every coated surface.1 All tracks of a cylinder can be accessed with only
minor adjustments at the same time by their respective heads. By moving
the arm around the arm pivot, other cylinders can be accessed. Each track is
partitioned into sectors. Sectors have a disk specific (almost) fixed capacity of
512 B. The read and write granularity is a sector. Read and write accesses take
place while the sector passes under the head.
The top view of Figure 4.1 shows that the outer sectors are longer than the
1
This assumption is valid for most but not all disks.

137
138CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

arm
arm head spindle
assembly sector track

platter
head
arm
arm
pivot

cylinder

a. side view b. top view

Figure 4.1: Disk drive assembly

inner sectors. The highest density (e.g. in bits per centimeter) at which bits
can be separated is fixed for a given disk. For storing 512 B, this results in a
minimum sector length which is used for the tracks of the innermost cylinder.
Thus, since sectors on outer tracks are longer, storage capacity is wasted there.
To overcome this problem, disks have a varying number of sectors per track.
(This is where the picture lies.) Therefore, the cylinders are organized into
zones. Every zone contains a fixed number of consecutive cylinders, each having
a fixed number of sectors per track. Between zones, the number of sectors per
track varies. Outer zones have more sectors per track than inner zones. Since
the platters rotate with a fixed angular speed, sectors of outer cylinders can be
read faster than sectors of inner cylinders. As a consequence, the throughput
for reading and writing outer cylinders is higher than for inner cylinders.
Assume that we sequentially read all the sectors of all tracks of some con-
secutive cylinders. After reading all sectors of some track, we must proceed to
the next track. If it is contained in the same cylinder, then we must (simply)
use another head: a head switch occurs. Due to calibration, this takes some
time. Thus, if all sectors start at the same angular position, we come too late
to read the first sector of the next track and have to wait. To avoid this, the
angular start positions of the sectors of tracks in the same cylinder are skewed
such that this track skew compensates for the head switch time. If the next
track is contained in another cylinder, the heads have to switch to the next
cylinder. Again, this takes time and we miss the first sector if all sectors of a
surface start at the same angular positions. Cylinder skew is used such that
the time needed for this switch does not make us miss to start reading the next
sector. In general, skewing works in only one direction.
A sector can be addressed by a triple containing its cylinder, head (surface),
and sector number. This triple is called the physical address of a sector. How-
ever, disks are accessed using logical addresses. These are called logical block
numbers (LBN) and are consecutive numbers starting with zero. The disk in-
ternally maps LBNs to physical addresses. This mapping is captured in the
following table:
4.1. DISK DRIVE 139

Host sends Controller Data transfer to host Status message to host

command decodes it

SCSI bus

Disk 1

Disk 2

Seek Rotational
Disk 3
latency
Data transfer off mechanism Time

Read service time for disk 1

Read service time for disk 2

Figure 4.2: Disk drive read request processing

cylinder track LBN number of sectors per track

0 0 0 573
1 573 573
... ... ... ...
5 2865 573
1 0 3438 573
... ... ... ...
15041 0 35841845 253
... ... ... ...

However, this ideal view is disturbed by the phenomenon of bad blocks. A

bad block is one with a defect and it cannot be read or written. After a block
with a certain LBN is detected to be bad, it is assigned to another sector. The
above mapping changes. In order to be able redirect LBNs, extra space on the
disk must exist. Hence, some cylinders, tracks, and sectors are reserved for this
purpose. They may be scattered all over the platters. Redirected blocks cause
hiccups during sequential read.
Building (see e.g. [633]) and modeling (see e.g. [576, 737, 800, 801, 867, 913])
disk drives is challenging. Whereas the former is not really important when
building query compiler, the latter is, as we have to attach costs to query eval-
uation plans. These costs reflect the amount of time we occupy the resource
disk. Since disks are relatively slow, they may become the bottleneck of a
database server. Modeling and minimizing disk access (time) is thus an im-
portant topic. Consider the case where we want to read a block from a SCSI
disk. Simplified, the following actions take place and take their time (see also
Fig. 4.2):

1. The host sends the SCSI command.

2. The disk controller decodes the command and calculates the physical
address.
140CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

3. During the seek the disk drive’s arm is positioned such that the accord-
ing head is correctly placed over the cylinder where the requested block
resides. This step consists of several phases.

(a) The disk controller accelerates the arm.

(b) For long seeks, the arm moves with maximum velocity (coast).
(c) The disk controller slows down the arm.
(d) The disk arm settles for the desired location. The settle times differ
for read and write requests. For reads, an aggressive strategy is used.
If, after all, it turns out that the block could not be read correctly,
we can just discard it. For writing, a more conservative strategy is
in order.

4. The disk has to wait until the sector where the requested block resides
comes under the head (rotation latency).

5. The disk reads the sector and transfers data to the host.

6. Finally, it sends a status message.

Note that the transfers for different read requests are interleaved. This is pos-
sible since the capacity of the SCSI bus is higher than the read throughput of
the disk. Also note that we did not mention the operating system delay and
congestions on the SCSI bus.
Disk drives apply several strategies to accelerate the above-mentioned round-
trip time and access patterns like sequential read. Among them are caching,
ToDo read-ahead, and command queuing. (discuss interleaving?)
The seek and rotation latency times highly depend on the head’s position
on the platter surface. Let us consider seek time. A good approximation of the
seek time where d cylinders have to be travelled is given by
√
c1 + c2 d d <= c0
seektime(d) =
c3 + c4 d d > c0
where the constants ci are disk-specific. The constant c0 indicates the maximum
number of cylinders where no coast takes place: seeking over a distance of more
than c0 cylinders results in a phase where the disk arm moves with maximum
velocity.
For disk accesses, the database system must be able to estimate the time
they take to be executed. First of all, we need the parameters of the disk. It
is not too easy to get hold of them, but we can make use of several tools to
extract them from a given disk [239, 307, 853, 759, 925, 926]. However, then we
have a big problem: when calculating I/O costs, the query compiler has no idea
where the head will be when the query evaluation plan emits a certain read (or
write) command. Thus, we have to find another solution. In the following, we
will discuss a rather simplistic cost model that will serve us to get a feeling for
disk behavior. Later, we develop a more realistic model (Section 4.17).
The solution is rather trivial: we sum up all command sending and inter-
preting times as well the times for positioning (seek and rotation latency) which
4.1. DISK DRIVE 141

form by far the major part. Let us call the result latency time. Then, we assume
an average latency time. This, of course, may result in large errors for a single
request. However, on average, the error can be as “low” as 35% [737]. The next
parameter is the sustained read rate. The disk is assumed to be able to deliver
a certain amount of bytes per second while reading data stored consecutively.
Of course, considering multi-zone disks, we know that this is oversimplified,
but we are still in our simplistic model. Analogously, we have a sustained write
rate. For simplicity, we will assume that this is the same as the sustained read
rate. Last, the capacity is of some interest. A hypothetical disk (inspired by
disks available in 2004) then has the following parameters:

Model 2004
Parameter Value Abbreviated Name
capacity 180 GB Dcap
average latency time 5 ms Dlat
sustained read rate 100 MB/s Dsrr
sustained write rate 100 MB/s Dswr

The time a disk needs to read and transfer n bytes is then approximated by
Dlat + n/Dsrr . Again, this is overly simplistic: (1) due to head switches and
cylinder switches, long reads have lower throughput than short reads and (2)
multiple zones are not modelled correctly. However, let us use this very sim-
plistic model to get some feeling for disk costs.
Database management system developers distinguish between sequential
I/O and random I/O. For sequential I/O, there is only one positioning at the
beginning and then, we can assume that data is read with the sustained read
rate. For random I/O, one positioning for every unit of transfer—typically a
page of say 8 KB—is assumed. Let us illustrate the effect of positioning by a
small example. Assume that we want to read 100 MB of data stored consecu-
tively on a disk. Sequential read takes 5 ms plus 1 s. If we read in blocks of
8 KB where each block requires positioning then reading 100 MB takes 65 s.
Assume that we have a relation of about 100 MB in size, stored on a disk,
and we want to read it. Does it take 1 s or 65 s? If the blocks on which it is
stored are randomly scattered on disk and we access them in a random order,
65 s is a good approximation. So let us assume that it is stored on consecutive
blocks. Assume that we read in chunks of 8 KB. Then,

• other applications,

• other transactions, and

• other read operations of the same query evaluation plan

could move the head away from our reading position. (Congestion on the SCSI
bus may also be problem.) Again, we could be left with 65 s. Reading the
whole relation with one read request is a possibility but may pose problems
to the buffer manager. Fortunately, we can read in chunks much smaller than
100 MB. Consider Figure 4.3. If we read in chunks of 100 8 KB blocks we are
already pretty close to one second (within a factor of two).
142CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

1
1 4 16 64

Figure 4.3: Time to read 100 MB from disk (depending on the number of 8 KB
blocks read at once)
4.1. DISK DRIVE 143

Note that the interleaving of actions does not necessarily mean a negative
impact. This depends on the point of view, i.e. what we want to optimize. If we
want to optimize response time for a single query, then obviously the impact of
concurrent actions is negative. If, however, we want to optimize resource (here:
disk) usage, concurrent actions might help. ToDo?
There are two important things to learn here. First, sequential read is much
faster than random read. Second, the runtime system should secure sequential
read. The latter point can be generalized: the runtime system of a database
management system has, as far as query execution is concerned, two equally
important tasks:

• allow for efficient query evaluation plans and

• allow for smooth, simple, and robust cost functions.

Typical measures on the database side are

• carefully chosen physical layout on disk

(e.g. cylinder or track-aligned extents [760, 761, 758], clustering),

• disk scheduling, multi-page requests

[224, 451, 769, 770, 777, 796, 827, 917, 924],

• (asynchronous) prefetching,

• piggy-back scans,

• buffering (e.g. multiple buffers, replacement strategy from [70] to [589]),

and last but not least

• efficient and robust algorithms for algebraic operators [341].

Let us take yet another look at it. 100 MB can be stored on 12800 8 KB
pages. Figure 4.4 shows the time to read n random pages. In our simplistic cost
model, reading 200 pages randomly costs about the same as reading 100 MB
sequentially. That is, reading 1/64th of 100 MB randomly takes as long as
reading the 100 MB sequentially. Let us denote by a the positioning time, s
the sustained read rate, p the page size, and d some amount of consecutively
stored bytes. Let us calculate the break-even point

n ∗ (a + p/s) = a + d/s
n = (a + d/s)/(a + p/s)
= (as + d)/(as + p)

a and s are disk parameters and, hence, fixed. For a fixed d, the break-even
point depends on the page size. This is illustrated in Figure 4.5. The x-axis is
the page size p in multiples of 1 K and the y-axis is (d/p)/n for d = 100 MB.
For sequential reads, the page size does not matter. (Be aware that our
simplistic model heavily underestimates sequential reads.) For random reads,
144CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

2.5

1.5

0.5

0
0 100 200 30

Figure 4.4: Time needed to read n random pages

4.1. DISK DRIVE 145

512

256

128

8
1 2 4 8

Figure 4.5: Break-even point in fraction of total pages depending on page size
146CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

as long as a single page is read, it matters neither: reading a single page of 1 KB

lasts 5.0097656 ms, for an 8 KB page the number is 5.0781250 ms. From all
this, we could draw the conclusion that the larger the page the better. However,
this is only true for the disk, not, e.g., for the buffer or the SCSI bus. If we need
to access only 500 B of a page, then the larger the page the higher the fraction
that is wasted. This is not as severe as it sounds. Other queries or transactions
might need other parts of the page during a single stay in the buffer. Let us
call the fraction of the page that is read by some transaction during a stay in
the buffer by utilization. Obviously, the higher the utilization the better is our
usage of the main memory in which the buffer resides. For smaller pages, the
utilization is typically higher than for larger pages. The frequency by which
pages are used is another factor. [361, 362].

Excursion. Consider the root page of a B-tree. It is accessed quite frequently

and most of its parts will be used, no matter how large it is. Hence, utilization
is always good. Thus, the larger the root page of a B-tree the better. On the
other hand, consider a leaf page of a B-tree that is much bigger than main
memory. During a single stay of it, only a small fraction of the page will be
used. That is, smaller leaf pages are typically better. By converting everything
to money instead of time, Gray and Graefe [361] as well as Lomet [558] come
to the conclusion that a page size between 8 and 16 KB was a good choice at
the end of the last century.
For the less simplistic model of disk access costs developed in Section 4.17,
we need to describe a disk drive by a set of parameters. These parameters are
summarized in Table 4.1.
Let us close this section by giving upper bounds on seek time and rotational
latency. Qyang proved the following theorem which gives a tight upper bound
of disk seek time if several cylinders of a consecutive range of cylinders have to
be visited [692].

Theorem 4.1.1 (Qyang) If the disk arm has to travel over a region of C
cylinders, it is positioned on the first of the C cylinders and has to stop at s − 1
of them, then sDseek (C/s) is an upper bound for the seek time.

The time required for s consecutive sectors in a track of zone i to pass by

the head is
Drot
Drot (s, i) = sDZscan (i) = s (4.1)
DZspt (i)
A trivial upper bound for the rotational delay is a full rotation.

4.2 Database Buffer

The database buffer
1. is a finite piece of memory,
2. typically supports a limited number of different page sizes (mostly one or
two),
4.3. PHYSICAL DATABASE ORGANIZATION 147

Dcyl total number of cylinders

Dtrack total number of tracks
Dsector total number of sectors
Dtpc number of tracks per cylinder (= number of surfaces)

Dcmd command interpretation time

Drot time for a full rotation
Drdsettle time for settle for read
Dwrsettle time for settle for write
Dhdswitch time for head switch

DZone total number of zones

DZcyl (i) number of cylinders in zone i
DZspt (i) number of sectors per track in zone i
DZspc (i) number of sectors per cylinder in zone i (= Dtpc DZspt (i))
DZscan (i) time to scan a sector in zone i (= Drot /DZspt (i))

Davgseek average seek costs

D c0 parameter for seek cost function
D c1 parameter for seek cost function
D c2 parameter for seek cost function
D c3 parameter for seek cost function
D c4 parameter for seek cost function

Dseek (d) cost of a seek

of d cylinders
√
D c1 + D c2 d if d ≤ Dc0
Dseek (d) =
D c3 + D c4 d if d > Dc0
Drot (s, i) rotation cost for s sectors of zone i (= sDZscan (i))

Table 4.1: Disk drive parameters and elementary cost functions

3. is often fragmented into several buffer pools,

4. each having a replacement strategy (typically enhanced by hints).

Given the page identifier, the buffer frame is found by a hashtable lookup.
Accesses to the hash table and the buffer frame need to be synchronized. Before
accessing a page in the buffer, it must be fixed. These points account for the
fact that the costs of accessing a page in the buffer are, therefore, greater than
zero.

4.3 Physical Database Organization

We call everything that is stored in the database and relevant for answering
queries a database item. Let us exclude meta data. In a relational system,
a database item can be a relation, a fragment of a relation (if the relation is
horizontally or vertically fragmented), a segment, an index, a materialized view,
148CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

or an index on a materialized view. In object-oriented databases, a database

item can be the extent of a class, a named object, an index and so forth. In
XML databases, a database item can be a named document, a collection of
documents, or an index. Access operations to database items form the leaves
of query evaluation plans.

Partition Relation
1 1

contains fragmented

N N
Segment N mapped M Fragment

1
N

consists of contains

N
Page

stores

N M
N represented 1
Record Tuple

Figure 4.6: Physical organization of a relational database

The physical algebra implemented in the query execution engine of some

runtime systems allow to access database items. Since most database items
consist of several data items (tuples, objects, documents), these access oper-
ations produce a stream of data items. This kind of collection-valued access
operation is called a scan. Consider the simple query
select *
from Student

This query is valid only if the database item (relation) Student exists. It could
4.3. PHYSICAL DATABASE ORGANIZATION 149

be accessible via a relation scan operation rscan(Student). However, in

reality we have to consider the physical organization of the database.
Figure 4.6 gives an overview of how relations can be stored in a relational
database system. Physical database items can be found on the left-hand side,
logical database items on the right-hand side. A fraction of a physical disk is
a partition. It can be an operating system file or a raw partition. A partition
is organized into several segments. A segment consists of several pages. The
pages within a segment are typically accessible by a non-negative integer in
[0, n[, where n is the number of pages of the segment2 . Iterative access to all
pages of a segment is typically possible. The access is called a scan. As there
are several types of segments (e.g. data segments, index segments), several kinds
of scans exist. Within a page, physical records are stored. Each physical record
represents a (part of a) tuple of a fragment of a relation.
Fragments are mapped to segments and relations are partitioned into frag-
ments. In the simplest and most common organization, every relation has only
one fragment with a one-to-one mapping to segments, and for every tuple there
exists exactly one record representing only this tuple. Hence, both of relation-
ships mapped and represented are one-to-one. However, this organization does
not scale well. A relation could be larger than a disk. Even if a large relation,
say 180 GB fits on a disk, scanning it takes half an hour (Model 2004). Hori-
zontal partitioning and allocation of the fragments on several disks reduces the
scan time by allowing for parallelism. Vertical partitioning is another means of
reducing I/O [202]. Here, a tuple is represented by several physical records, each
one containing a subset of the tuple’s attributes. Since the relationship mapped
is N:M, tuples from different relations can be stored in the same segment. Fur-
thermore, in distributed database systems some fragments might be stored re-
dundantly at different locations to improve access times [134, 506, 693, 652].
Some systems support clustering of tuples of different relations. For exam-
ple, department tuples can be clustered with employee tuples such that those
employees belonging to the department are close together and close to their
department tuple. Such an organization speeds up join processing.
To estimate costs, we need a model of a segment. We assume an extent-
based implementation. That is, a segment consists of several extents3 . Each
extent occupies consecutive sectors on disk. For simplicity, we assume that
whole cylinders belong to a segment. Then, we can model segments as follows.
Each segment consists of a sequence of extents. Each extent is stored on con-
secutive cylinders. Cylinders are exclusively assigned to a segment. We then
describe each extent j as a pair (Fj , Lj ) where Fj is the first and Lj the last
cylinder of a consecutive sequence of cylinders. A segment can then be described
by a sequence of such pairs. We assume that these pairs are sorted in ascending
order. In such a description, an extent may include a zone boundary. Since cost
functions are dependent on the zone, we break up cylinder ranges that are not
contained in a single zone. The result can be described by a sequence of triples

2
This might not be true. Alternatively, the pages of a partition can be consecutively
numbered.
3
Extents are not shown in Fig. 4.6. They can be included between Partitions and Segments.
150CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

273 2
273 827

827 1

Figure 4.7: Slotted pages and TIDs

(Fi , Li , zi ) where Fi and Li mark a range of consecutive cylinders in a zone zi .

Although the zi ’s can be inferred from the cylinder numbers, we include them
for clarity. Also of interest are the total number of sectors in a segment and the
number of cylinders Scpe (i) in an extent i. Summarizing, we describe a segment
by the parameter given in Table 4.2.

Sext number of extents in the segment

P ext
Ssec total number of sectors in the segment (= Si=1 Scpe (i)DZspc (Szone (i)))
Sfirst (i) first cylinder in extent i
Slast (i) last cylinder in extent i
Scpe (i) number of cylinders in extent i (= Slast (i) − Sfirst (i) + 1)
Szone (i) zone of extent i

Table 4.2: Segment parameters

4.4 Slotted Page and Tuple Identifier (TID)

Let us briefly review slotted pages and the concept of tuple identifiers (TIDs)
(see Figure 4.7) [41, 40, 559, 834]. Sometimes, record identifer or row identifier
(RID) is used in the literature. A TID consists of (at least) two parts. The
first part identifies a page, the second part a slot on a slotted page. The slot
contains—among other things, e.g. the record’s size—a (relative) pointer to the
actual record. This way, the record can be moved within the page without
invalidating its TID. When a record grows beyond the available space, it is
moved to another page and leaves a forward pointer (again consisting of a page
and a slot identifier) in its original position. This happened to the TID [273, 1]
in Figure 4.7. If the record has to be moved again, the forward pointer is
adjusted. This way, at most two page accesses are needed to retrieve a record,
given its TID. For evaluating the costs of record accesses, we will assume that
4.5. PHYSICAL RECORD LAYOUTS 151

273 2
273 827

827 1

Figure 4.8: Various physical record layouts

the fraction of moved records is known.

4.5 Physical Record Layouts

A physical record represents a tuple, object, or some other logical entity or
fraction thereof. In case it represents a tuple, it consists of several fields, each
representing the value of an attribute. These values can be integers, floating
point numbers, or strings. In case of object-oriented or object-relational sys-
tems, the values can also be of a complex type. Tuple identifiers are also possible
as attribute values [718]. This can, for example, speed up join processing.
In any case, we can distinguish between types whose values all exhibit the
same fixed length and those whose values may vary in length. In a physical
record, the values of fixed-length attributes are concatenated and the offset
from the beginning of the record to the value of some selected attribute can
be inferred from the types of the values preceding it. This differs for values
of varying length. Here, several encodings are possible. Some simple ones are
depicted in Figure 4.8. The topmost record encodes varying length values as a
sequence of pairs of the form [size, value]. This encoding has the disadvantage
that access to an attribute of varying length is linear in the number of those
preceding it. This disadvantage is avoided in the solution presented in the
middle. Instead of storing the sizes of the individual values, there is an array
containing relative offsets into the physical record. They point to the start of
the values. The length of the values can be inferred from these offsets and,
in case of the last value, from the total length of the physical record, which is
typically stored in its slot. Access to a value of varying size is now simplified to
an indirect memory access plus some length calculations. Although this might
152CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

be cheaper than the first solution, there is still a non-negligible cost associated
with an attribute access.
The third physical record layout can be used to represent compressed at-
tribute values and even compressed length information for parts of varying size.
Note that if fixed size fields are compressed, their length becomes varying. Ac-
cess to an attribute now means decompressing length/offset information and
decompressing the value itself. The former is quite cheap: it boils down to an
indirect memory access with some offset taken from an array [908]. The cost
of the latter depends on the compression scheme used. It should be clear that
accessing an attribute value now is even more expensive. To make the costs of
an attribute access explicit was the sole purpose of this small section.
Remark Westmann et al. discuss an efficient implementation of compres-
sion and evaluate its performance [908]. Yiannis and Zobel report on experi-
ments with several compression techniques used to speed up the sort operator.
For some of them, the CPU usage is twice as large [946].

4.6 Physical Algebra (Iterator Concept)

Physical algebraic operators are mostly implemented as iterators. This means
that they support the the interface operations open, next, and close. With
open, the stream of items (e.g. tuples) is initialized. With next, the next item
on the stream is fetched. When no more items are available, e.g. next returns
false, close can be called to clean up things. The iterator concept is explained
in many text books (e.g. [312, 397, 476]) and the query processing survey by
Graefe [341]. This basic iterator concept has been extended to better cope
with nested evaluation by Westmann in his thesis [906], Westmann et al. [908],
and Graefe [345]. The two main issues are separation of storage allocation and
initialization, and batched processing. The former splits open into resource
allocation, initialization of the operator, and initialization of the iterator.

4.7 Simple Scan

Let us come back to the scan operations. A logical operation for scanning rela-
tions (which could be called rscan) is rarely supported by relational database
management systems. Instead, they provide (physical) scans on segments. Since
a (data) segment is sometimes called file, the correct plan for the above query
is often denoted by fscan(Student). Several assumptions must hold: the
Student relation is not fragmented, it is stored in a single segment, the name
of this segment is the same as the relation name, and no tuples from other
relations are stored in this segment. Until otherwise stated, we will assume
that relations are not partitioned, are stored in a single segment and that the
segment can be inferred from the relation’s name. Instead of fscan(Student),
we could then simply use Student to denote leaf nodes in a query execution
plan. If we want to use a variable that is bound subsequently to each tuple in
a relation, the query
select *
4.8. SCAN AND ATTRIBUTE ACCESS 153

from Student

can be expressed as Student[s] instead of Student. In this notation, the output

stream contains tuples having a single attribute s bound to a tuple. Physically,
s will not hold the whole tuple but, for example, a pointer into the buffer where
the tuple can be found. An alternative is a pointer to a slot of a slotted page
contained in the buffer.
A simple scan is an example for a building block . In general, a building
block is something that is used as a bottommost operator in a query evaluation
plan. Hence, every leaf node of a query evaluation plan is a building block or
a part thereof. This is not really a sharp definition, but is sometimes useful
to describe the behavior of a query compiler: after their determination, it will
leave building blocks untouched even if reorderings are hypothetically possible.
Although a building block can be more than a leaf node (scan) of a query
evaluation plan, it will never include more than a single database item. As
soon as more database items are involved, we use the notion of access path, a
term which will become more precise later on when we discuss index usage.
The disk access costs for a simple scan can be derived from the considera-
tions in Section 4.1 and Section 4.17.

4.8 Scan and Attribute Access

Strictly speaking, a plan like σage>30 (Student[s]) is invalid, since the tuple
stream produced by Student[s] contains tuples with a single attribute s. We
have a choice. Either we assume that attribute access takes place implicitly, or
we make it explicit. Whether this makes sense or not depends on the database
management system for which we generate plans. Let us discuss the advantages
of explicit attribute retrieval. Assume s.age retrieves the age of a student.
Then we can write σs.age>30 (Student[s]), where there is some non-neglectable
cost for s.age. The expression σs.age>30∧s.age<40 (Student[s]) executes s.age
twice. This is a bad idea. Instead, we would like to retrieve it once and reuse
it later.
This purpose is well-served by the map operator (χ). It adds new attributes
to a given tuple and is defined as

χa1 :e1 ,...,an :en (e) := {t ◦ [a1 : c1 , . . . , an : cn ]|t ∈ e, ci = ei (t) ∀ (1 ≤ i ≤ n)}

where ◦ denotes tuple concatenation and the ai must not be in A(e). (Remem-
ber that A(e) is the set of attributes produced by e.) Every input tuple t is
extended by new attributes ai , whose values are computed by evaluating the
expression ei , in which free variables (attributes) are bound to the attributes
(variables) provided by t.
The above problem can now be solved by

σage>30∧age<40 (χage:s.age (Student[s])).

154CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

In general, it is beneficial to load attributes as late as possible. The latest point

at which all attributes must be read from the page is typically just before a
pipeline breaker4 .
To see why this is useful, consider the simple query

select name
from Student
where age > 30

The plan
Πn (χn:s.name (σa>30 (χa:s.age (Student[s]))))

makes use of this feature, while

Πn (σa>30 (χn:s.name,a:s.age (Student[s])))

does not. In the first plan the name attribute is only accessed for those students
with age over 30. Hence, it should be cheaper to evaluate. If the database
management system does not support this selective access mechanism, we often
find the scan enhanced by a list of attributes that is projected and included in
the resulting tuple stream.
In order to avoid copying attributes from their storage representation to
some main memory representation, some database management systems apply
another mechanism. They support the evaluation of some predicates directly
on the storage representation. These are boolean expressions consisting of sim-
ple predicates of the form Aθc for attributes A, comparison operators θ, and
constants c. Instead of a constant, c could also be the value of some attribute
or expression thereof given that it can be evaluated before the access to A.
Predicates evaluable on the disk representation are called SARGable where
SARG is an acronym for search argument. Note that SARGable predicates
may also be good for index lookups. Then they are called index SARGable.
In case they can not be evaluated by an index, they are called data SARGable
[772, 850, 318].
Since relation or segment scans can evaluate predicates, we have to extend
our notation for scans. Let I be a database item like a relation or segment.
Then, I[v; p] scans I, binds each item in I successively to v and returns only
those items for which p holds. I[v; p] is equivalent to σp (I[v]), but cheaper to
evaluate. If p is a conjunction of predicates, the conjuncts should be ordered
such that the attribute access cost reductions described above are reflected
(for details see Chapter ??). Syntactically, we express this by separating the
predicates by a comma as in Student[s; age > 30, name like ‘%m%0 ]. If we want
to make a distinction between SARGable and non-SARGable predicates, we
write I[v; ps ; pr ], with ps being the SARGable predicate and pr a non-SARGable
predicate. Additional extensions like a projection list are also possible.
4
The page on which the physical record resides must be fixed until all attributes are loaded.
Hence, an earlier point in time might be preferable.
4.9. TEMPORAL RELATIONS 155

4.9 Temporal Relations

Scanning a temporal relation or segment also makes sense. Whenever the result
of some (partial) query evaluation plan is used more than once, it might be
worthwhile to materialize it in some temporary relation. For this purpose, a
tmp operator evaluates its argument expression and stores the result relation in
a temporary segment. Consider the following example query.

select e.name, d.name

from Emp e, Dept d
where e.age > 30 and e.age < 40 and e.dno = d.dno

It can be evaluated by

Dept[d] Bnl
e.dno=d.dno σe.age>30∧e.age<40 (Emp[d]).

Since the inner (right) argument of the nested-loop join is evaluated several
times (once for each department), materialization may pay off. The plan then
looks like

Dept[d] Bnl
e.dno=d.dno Tmp(σe.age>30∧e.age<40 (Emp[d])).

If we choose to factorize and materialize a common subexpression, the query

evaluation plan becomes a DAG. Alternatively, we could write a small “pro-
gram” that has some statements materializing some expressions which are then
used later on. The last expression in a program determines its result. For our
example, the program looks as follows.

1. Rtmp = σe.age>30∧e.age<40 (Emp[d]);

2. Dept[d] Bnl
e.dno=d.dno Rtmp [e]

The disk costs of writing and reading temporary relations can be calculated
using the considerations of Section 4.1.

4.10 Table Functions

A table function is a function that returns a relation [566]. An example is
Primes(int from, int to), which returns all primes between from and to,
e.g. via a sieve-method. It can be used in any place where a relation name can
occur. The query

select *
from TABLE(Primes(1,100)) as p

returns all primes between 1 and 100. The attribute names of the resulting
relation are specified in the declaration of the table function. Let us assume
that for Primes a single attribute prime is specified. Note that table func-
tions may take parameters. This does not pose any problems, as long as we
156CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

know that Primes is a table function and we translate the above query into
Primes(1, 100)[p]. Although this looks exactly like a table scan, the implemen-
tation and cost calculations are different.
Consider the following query where we extract the years in which we expect
a special celebration of Anton’s birthday.

select *
from Friends f,
TABLE(Primes(
CURRENT YEAR, EXTRACT(YEAR FROM f.birthday) + 100)) as p
where f.name = ‘Anton’

The result of the table function depends on our friend Anton. Hence, a join
is no solution. Instead, we have to introduce a new kind of join, the d-join
where the d stands for dependent. It is defined as

R < S >= {t ◦ s|t ∈ T, s ∈ S(t)}.

The above query can now be evaluted as

χb:EXT RACT Y EAR(f.birthday)+100 (σf.name=‘Anton0 (Friends[f ])) < Primes(c, b)[p] >

where we assume that some global entity c holds the value of CURRENT YEAR.
Let us do the above query for all friends. We just have to drop the where
clause. Obviously, this results in many redundant computations of primes. At
the SQL level, using the birthday of the youngest friend is beneficial:

select *
from Friends f,
TABLE(Primes(
CURRENT YEAR, (select max(birthday) from Friends) + 100)) as p
where p.prime ≥ f.birthday

At the algebraic level, this kind of optimizations will be considered in Section ??.
Things can get even more involved if table functions can consume and pro-
ToDo? duce relations, i.e. arguments and results can be relations.
Little can be said about the disk costs of table functions. They can be zero
if the function is implemented such that it does not access any disks (files stored
there), but it can also be very expensive if large files are scanned each time it is
called. One possibility is to let the database administrator specify the numbers
the query optimizer needs. However, since parameters are involved, this is
not really an easy task. Another possibility is to measure the table function’s
behavior whenever it is executed, and learn about its resource consumption.

4.11 Indexes
There exists a plethora of different index structures. In the context of relational
database management systems, the most versatile and robust index is the B-
tree or variants/improvements thereof (e.g. []). It is implemented in almost
4.11. INDEXES 157

Figure 4.9: Clustered vs. non-clustered index

every commercial database management system. Some support hash-indexes

(e.g. []). Other data models or specific applications need specialized indexes.
There exist special index structures for indexing path expressions in object-
oriented databases (e.g. []) and XML databases (e.g. []). Special purpose indexes
include join indexes (e.g. [395, 879]) multi-dimensional indexes (e.g. []), variant
(projection) indexes [638], small materialized aggregates [603], bitmap indexes
[], and temporal indexes (e.g. []). We cannot discuss all indexes and their
exploitations for efficient query evaluation. This fills more than a single book.
Instead, we concentrate on B-tree indexes. In general, a B-tree can be used
to index several relations. We only discuss cases where B-trees index a single
relation.
The search key (or key for short) of an index is the sequence of attributes
of the indexed relation over which the index is defined. A key is a simple key if
it consists of a single attribute. Otherwise, it is a complex key. Each entry in
the B-tree’s leaf page consists of pairs containing the key values and a sequence
of tuple identifiers (typically sorted by increasing page number). Every tuple
with a TID in this list satisfies the condition that its indexed attribute’s values
are equal to the key values. If for every sequence of key values there is at most
one such tuple, we have a unique index, otherwise a non-unique index .
The leaf entries may contain values from additional (non-key) attributes.
Then we call the index attribute data added and the additional attributes data
attributes. If the index contains all attributes of the indexed relation—in its
key or data attributes—storing the relation is no longer necessary. The result is
an index-only relation. In this case, the concept of tuple identifiers is normally
no longer used since tuples can now be moved frequently, e.g. due to a leaf page
split. This has two consequences. First, the data part does not longer contain
the TID. Second, other indexes on the index-only relation cannot have tuple
identifiers in their data part either. Instead, they use the key of the index-only
relation to uniquely reference a tuple. For this to work, we must have a unique
index.
B-trees can be either clustered or non-clustered indexes. In a clustered index,
the tuple identifiers in the list of leaf pages are ordered according to their page
158CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

numbers. Otherwise, it is a non-clustered index5 . Figure 4.9 illustrates this.

Range queries result in sequential access for clustered indexes and in random
access for non-clustered indexes.

4.12 Single Index Access Path

4.12.1 Simple Key, No Data Attributes
Consider the exact match query
select name
from Emp
where eno = 1077

If there exists a unique index on the key attribute eno, we can first access the
index to retrieve the TID of the employee tuple satisfying eno = 1077. Another
page access yields the tuple itself which constitutes the result of the query. Let
Empeno be the index on eno, then we can descend the B-tree, using 1077 as the
search key. A predicate that can be used to descend the B-tree or, in general,
governing search within an index structure, is called an index sargable predicate.
For the example query, the index scan, denoted as Empeno [x; eno = 1077],
retrieves a single leaf node entry with attributes eno and TID. Similar to the
regular scan, we assume x to be a variable holding a pointer to this index
entry. We use the notations x.eno and x.TID to access these attributes. To
dereference the TID, we use the map (χ) operator and a dereference function
deref (or ∗ for short). It turns a TID into a pointer in the buffer area. This of
course requires the page to be loaded, if it is not in the buffer yet. The complete
plan for the query is
Πname (χe:∗(x.TID),name:e.name (Empeno [x; eno = 1077]))
where we computed several new attributes with one χ operator. Note that
they are dependent on previously computed attributes and, hence, the order of
evaluation does matter.
We can make the dependency of the map operator more explicit by applying
a d-join. Denote by 2 an operator that returns a single empty tuple. Then
Πname (Empeno [x; eno = 1077] < χe:∗(x.TID),name:e.name (2) >)
is equivalent to the former plan. Joins and indexes will be discussed in Sec-
tion 4.14.
A range query like
select name
from Emp
where age ≥ 25 and age ≤ 35

5
Of course, any degree of clusteredness may occur and has to be taken into account in cost
calculations.
4.12. SINGLE INDEX ACCESS PATH 159

specifies a range for the indexed attribute. It is evaluated by an index scan

with start and stop conditions. In our case, the start condition is age ≥ 25,
and the stop condition is age ≤ 35. The start condition is used to retrieve the
first tuple satisfying it by searching within the B-tree. In our case, 25 is used to
descend from the root to the leaf page containing the key 25. Then, all records
with keys larger than 25 within the page are searched. Since entries in B-tree
pages are sorted on key values, this is very efficient. If we are done with the
leaf page that contains 25 and the stop key has not been found yet, we proceed
to the next leaf page. This is possible since leaf pages of B-trees tend to be
chained. Then all records of the next leaf page are scanned and so on until we
find the stop key. The complete plan then is
Πname (χe:∗(x.TID),name:e.name (Empage [x; 25 ≤ age; age ≤ 35]))
If the index on age is non-clustered, this plan results in random I/O. We
can turn random I/O into sequential I/O by sorting the result of the index scan
on its TID attribute before dereferencing it6 . This results in the following plan:
Πname (χe:∗(TID),name:e.name (SortTID (Empage [x; 25 ≤ age; age ≤ 35; TID])))
Here, we explicitly included the TID attribute of the index into the projection
list.
Consider a similar query which demands the output to be sorted:
select name, age
from Emp
where age ≥ 25 and age ≤ 35
order by age
Since an index scan on a B-tree outputs its result ordered on the indexed at-
tribute, the following plan produces the perfect result:
Πname,age (χe:∗(x.TID),name:e.name (Empage [x; 25 ≤ age; age ≤ 35]))
On a clustered index this is most probably the best plan. On a non-clustered
index, random I/O disturbs the picture. We avoid that by sorting the result of
the index scan on the TID attribute and, after accessing the tuples, restore the
order on age as in the following plan:
Πname,age (Sortage (χe:∗(TID),name:e.name (SortTID (Empage [x; 25 ≤ age; age ≤ 35; TID]))))
An alternative to this plan is not to sort on the original indexed attribute (age
in our example), but to introduce a new attribute that holds the rank in the
sequence derived from the index scan. This leads to the plan
Πname,age (
Sortrank (
χe:∗(TID),name:e.name (
SortTID (
χrank:counter++ (
Empage [x; 25 ≤ age; age ≤ 35; TID])))))
6
This might not be necessary, if Emp fits main memory. Then, preferably asynchronous I/O
should be used.
160CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

This alternative might turn out to be more efficient since sorting on an attribute
with a dense domain can be implemented efficiently. (We admit that in the
above example this is not worth considering.) There is another important
application of this technique: XQuery often demands output in document order.
If this order is destroyed during processing, it must at the latest be restored
when the output it produced [581]. Depending on the implementation (i.e. the
representation of document nodes or their identifiers), this might turn out to
be a very expensive operation.
The fact that index scans on B-trees return their result ordered on the
indexed attributes is also very useful if a merge-join on the same attributes (or
a prefix thereof, see Chapter 23 for further details) occurs. An example follows
later on.
Some predicates are not index SARGable, but can still be evaluated with
the index as in the following query
select name
from Emp
where age ≥ 25 and age ≤ 35 and age 6= 30

The predicate age 6= 30 is an example of a residual predicate. We can once

more extend the index scan and compile the query into

Πname (χt:x.TID,e:∗t,name:e.name (Empage [x; 25 ≤ age; age ≤ 35; age 6= 30]))

Some index scan implementations allow exclusive bounds for start and stop
conditions. With them, the query
select name
from Emp
where age > 25 and age < 35

can be evaluated using

Πname (χt:x.TID,e:∗t,name:e.name (Empage [x; 25 < age; age < 35]))

If this is not the case, two residual predicates must be used as in

Πname (χt:x.TID,e:∗t,name:e.name (Empage [x; 25 ≤ age; age ≤ 35; age 6= 25, age 6= 35]))

Especially for predicates on strings, this might be expensive.

Start and stop conditions are optional. To evaluate
select name
from Emp
where age ≥ 60
we use age ≥ 60 as the start condition to find the leaf page containing the key
60. From there on, we scan all the leaf pages “to the right”.
If we have no start condition, as in
4.12. SINGLE INDEX ACCESS PATH 161

select name
from Emp
where age ≤ 20

we descend the B-tree to the “leftmost” page, i.e. the page containing the
smallest key value, and then proceed scanning leaf pages until we encounter the
key 20.
Having neither a start nor stop condition is also quite useful. The query

select count(*)
from Emp

can be evaluated by counting the entries in the leaf pages of a B-tree. Since
a B-tree typically occupies far fewer pages than the original relation, we have
a viable alternative to a relation scan. The same applies to the aggregate
functions sum and avg. The other aggregate functions min and max can be
evaluated much more efficiently by descending to the leftmost or rightmost leaf
page of a B-tree. This can be used to answer queries like

select min/max(salary)
from Emp

much more efficiently than by a relation scan. Consider the query

select name
from Emp
where salary = (select max(salary)
from Emp)

It can be evaluated by first computing the maximum salary and then retrieving
the employees earning this salary. This requires two descendants into the B-
tree, while obviously one is sufficient. Depending on the implementation of the
index (scan), we might be able to perform this optimization.
Further, the result of an index scan, whether it uses start and/or stop con-
ditions or not, is always sorted on the key. This property can be useful for
queries with no predicates. If we have neither a start nor a stop condition, the
resulting scan is called full index scan. As an example consider the query

select salary
from Emp
order by salary

which is perfectly answered by the following full index scan:

Empsalary

So far, we have only seen indexes on numerical attributes.

162CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

select name, salary

from Emp
where name ≥ ’Maaa’
gives rise to a start condition 0 Maaa0 ≤ name. From the query
select name, salary
from Emp
where name like ’M%’
we can deduce the start condition 0 M0 ≤ name.
To express all the different alternatives of index usage, we need a powerful
(and runtime system dependent) index scan expression. Let us first summarize
what we can specify for an index scan:
1. the name of the variable for index entries (or pointers to them),

2. the start condition,

3. the stop condition,

4. a residual predicate, and

5. a projection list.
A projection list has entries of the form a : x.b for attribute names a and b and
x being the name of the variable for the index entry. a : x.a is also allowed and
often abbreviated as a. We also often summarize start and stop conditions into
a single expression like in 25 ≤ age ≤ 35.
For a full index specification, we list all items in the subscript of the index
name separated by a semicolon. Still, we need some extensions to express the
queries with aggregation. Let a and b be attribute names, then we allow entries
of the form b : aggr(a) in the projection list and start/stop conditions of the
form min/max(a). The latter tells us to minimize/maximize the value of the
indexed attribute a. Only a complete enumeration gives us the full details. On
the other hand, extracting start and stop conditions and residual predicates
from a given boolean expression is rather simple. Hence, we often summarize
these three under a single predicate. This is especially useful when talking
about index scans in general. If we have a full index scan, we leave out the
predicate. We use a star ‘*’ as an abbreviated projection list that projects all
attributes of the index. (So far, these are the key attribute and the TID.) If
the projection list is empty, we assume that only the variable/attribute holding
the pointer to the index entry is projected.
Using this notation, we can express some plan fragments. These fragments
are complete plans for the above queries, except that the final projection is not
present. As an example, consider the following fragment:

χe:∗TID,name:e.name (Empsalary [x; TID, salary])

All the plan fragments seen so far are examples of access paths. An access
path is a plan fragment with building blocks concerning a single database item.
4.12. SINGLE INDEX ACCESS PATH 163

Hence, every building block is an access path. The above plans touch two
database items: a relation and an index on some attribute of that relation.
If we say that an index concerns the relation it indexes, such a fragment is an
access path. For relational systems, the most general case of an access path uses
several indexes to retrieve the tuples of a single relation. We will see examples
of these more complex access paths in the following section. An access to the
original relation is not always necessary. A query that can be answered solely
by accessing indexes is called an index only query.
A query with in like
select name
from Emp
where age in {28, 29, 31, 32}
can be evaluated by taking the minimum and the maximum found in the left-
hand side of in as the start and stop conditions. We further need to construct
a residual predicate. The residual predicate can be represented either as age =
28 ∨ age = 29 ∨ age = 31 ∨ age = 32 or as age 6= 30.
An alternative is to use a d-join. Consider the example query
select name
from Emp
where salary in {1111, 11111, 111111}
Here, the numbers are far apart and separate index accesses might make sense.
Therefore, let us create a temporary relation Sal equal to {[s : 1111], [s :
11111], [s : 111111]}. When using it, the access path becomes

Sal[S] < χe:∗TID,name:e.name (Empsalary [x; salary = S.s; TID]) >

Some B-tree implementations allow efficient searches for multiple ranges and
implement gap skipping [33, 34, 166, 318, 319, 468, 536]. Gap skipping, some-
times also called zig-zag skipping, continues the search for keys in a new key
range from the latest position visited. The implementation details vary but
the main idea of it is that after one range has been completely scanned, the
current (leaf) page is checked for its highest key. If it is not smaller than the
lower bound of the next range, the search continues in the current page. If it
is smaller than the lower bound of the next range, alternative implementations
are described in the literature. The simplest is to start a new search from the
root for the lower bound. Another alternative uses parent pointers to go up a
page as long as the highest key of the current page is smaller than the lower
bound of the next range. If this is no longer the case, the search continues
downwards again.
Gap skipping gives even more opportunities for index scans and allows effi-
cient implementations of various index nested loop join strategies.

4.12.2 Complex Keys and Data Attributes

In general, an index can have a complex key comprised of the key attributes
k1 , . . . , kn and the data attributes d1 , . . . , dm . One possibility is to use a full
164CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

index scan on such an index. Having more attributes in the index makes it
more probable that queries are index-only.
Besides a full index scan, the index can be descended to directly search for
the desired tuple(s). Let us take a closer look at this possibility.
If the search predicate is of the form

k1 = c1 ∧ k2 = c2 ∧ . . . ∧ kj = cj

for some constants ci and some j <= n, we can generate the start and stop
condition
k1 = c1 ∧ . . . ∧ kj = cj .
This simple approach is only possible if the search predicates define values for
all search key attributes, starting from the first search key and then for all
keys up to the j-th search key with no key attribute unspecified in between.
Predicates concerning the other key attributes after the first non-specified key
attribute and the additional data attributes only allow for residual predicates.
This condition is often not necessary for multi-dimensional index structures,
whose discussion is beyond the book.
With ranges things become more complex and highly dependent on the
implementation of the facilities of the B-tree. Consider a query predicate re-
stricting key values as follows

k1 = c1 ∧ k2 ≥ c2 ∧ k3 = c3

Obviously, we can generate the start condition k1 = c1 ∧ k2 ≥ c2 and the stop

condition k1 = c1 . Here, we neglected the condition on k3 which becomes a
residual predicate. However, with some care we can extend the start condition
to k1 = c1 ∧ k2 ≥ c2 ∧ k3 = c3 : we only have to keep k3 = c3 as a residual
predicate, since for k2 values larger than c2 , values different from c3 can occur
for k3 .
If closed ranges are specified for a prefix of the key attributes as in

a1 ≤ k1 ≤ b1 ∧ . . . ∧ aj ≤ kj ≤ bj

we can generate the start key k1 = a1 ∧ . . . ∧ kj = aj , the stop key k1 =

b1 ∧ . . . ∧ kj = bj , and

a2 ≤ k2 ≤ b2 ∧ . . . ∧ aj ≤ kj ≤ bj

as the residual predicate. If for some search key attribute kj the lower bound aj
is not specified, the start condition cannot contain kj and any kj+i . If for some
search key attribute kj the upper bound bj is not specified, the stop condition
cannot contain kj and any kj+i .
Two further enhancements of the B-tree functionality possibly allow for
alternative start/stop conditions:

• The B-tree implementation allows to specify the order (ascending or de-

scending) for each key attribute individually.
4.13. MULTI INDEX ACCESS PATH 165

• The B-tree implementation implements forward and backward scans (e.g.

implemented in Rdb [33]).

So far, we are only able to exploit query predicates which specify value
ranges for a prefix of all key attributes. Consider querying a person on his/her
height and his/her hair color: haircolor = ’blond’ and height between
180 and 190. If we have an index on sex, haircolor, height, this index
cannot be used by means of the techniques described so far. However, since
there are only the two values male and female available for sex, we can rewrite
the query predicate to (sex = ’m’ and haircolor = ’blond’ and height
between 180 and 190) or (sex = ’f’ and haircolor = ’blond’ and height
between 180 and 190) and use two accesses to the index. This approach works
fine for attributes with a small domain and is described by Antoshenkov [34].
(See also the previous section for gap skipping.) Since the possible values for
key attributes may not be known to the query optimizer, Antoshenkov goes
one step further and shifts the construction of search ranges to index scan time.
Therefore, the index can be provided with a complex boolean expression which
is then refined (rewritten) as soon as search key values become known. Search
ranges are then generated dynamically, and gap skipping is applied to skip the
intervals between the qualifying ranges during the index scan.

4.13 Multi Index Access Path

We wish to buy a used digital camera and state the following query:

select *
from Camera
where megapixel > 5 and distortion < 0.05
and noise < 0.01
and zoomMin < 35 and zoomMax > 105

We assume that on every attribute used in the where clause there exists an
index. Since the predicates are conjunctively connected, we can use a technique
called index and-ing. Every index scan returns a set (list) of tuple identifiers.
These sets/lists are then intersected. This operation is also called And merge
[553]. Using index and-ing, a possible plan is

((((
Cameramegapixel [c; megapixel > 5; TID]
∩
Cameradistortion [c; distortion < 0.05; TID])
∩
Cameranoise [c; noise < 0.01; TID])
∩
CamerazoomMin [c; zoomMin < 35; TID])
∩
CamerazoomMax [c; zoomMax > 105; TID])
166CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

This results in a set of tuple identifiers that only needs to be dereferenced to

access the according Camera tuples and produce the final result.
Since the costs of the expression clearly depend on the costs of the index
scans and the size of the intermediate TID sets, two questions arise:
• In which order do we intersect the TID sets resulting from the index
scans?
• Do we really apply all indexes before dereferencing the tuple identifiers?
The answer to the latter question is clearly “no”, if the next index scan is
more expensive than accessing the records in the current TID list. It can be
shown that the indexes in the cascade of intersections are ordered on increasing
(fi − 1)/ci terms, where fi is the selectivity of the index and ci its access
EX cost. Further, we can stop as soon as accessing the original tuples in the base
relation becomes cheaper than intersecting with another index and subsequently
accessing the base relation.
Index or-ing is used to process disjunctive predicates. Here, we take the
union of the TID sets to produce a set of TIDs containing references to all
qualifying tuples. Note that duplicates must be eliminated during the process-
ing of the union. This operation is also called Or merge [553]. Consider the
query
select *
from Emp
where yearsOfEmployment ≥ 30
or age ≥ 65
This query can be answered by constructing a TID set using the expression
EmpyearsOfEmployment [c; yearsOfEmployment ≥ 30; TID]∪Empage [c; age ≥ 65; TID]
and then dereferencing the list of tuple identifiers. Again, the index accessing
can be ordered for better performance. Given a general boolean expression
in and and or, constructing the optimal access path using index and-ing and
or-ing is a challenging task that will be discussed in Chapter ??. This task
is even more challenging, if some simple predicates occur more than once in
the complex boolean expression and factorization has to be taken into account.
This issue was first discussed by Chaudhuri, Ganesan and Saragawi [146]. We
will come back to this in Chapter ??.
The names index and-ing and or-ing become clear if bitmap indexes are
considered. Then the bitwise and and or operations can be used to efficiently
ToDo compute the intersection and union.

Excursion on bitmap indexes. 2

There are even more possibilities to work with TID sets. Consider the query
select *
from Emp
where yearsOfEmployment 6= 10
and age ≥ 65
4.14. INDEXES AND JOINS 167

This query can be evaluated by scanning the index on age and then eliminating
all employees with yearsOfEmployment = 10:

Empage [c; age ≥ 65; TID]\ EmpyearsOfEmployment [c; yearsOfEmployment 6= 10; TID]

Let us call the application of set difference on index scan results index differ-
encing.
Some predicates might not be very restrictive in the sense that more than
half the index has to be scanned. By negating these predicates and using
index differencing, we can make sure that at most half of the index needs to be
scanned. As an example consider the query

select *
from Emp
where yearsOfEmployment ≤ 5
and age ≤ 65

Assume that most of our employees’ age is below 65. Then

EmpyearsOfEmployment [c; yearsOfEmployment ≤ 5; TID] \ Empage [c; age > 65; TID]

could be more efficient than

EmpyearsOfEmployment [c; yearsOfEmployment ≤ 5; TID] ∩ Empage [c; age ≤ 65; TID]

4.14 Indexes and Joins

There are two issues when discussing indexes and joins. The first is that indexes
can be used to speed up join processing. The second is that index accesses can
be expressed as joins. We discuss both of these issues, starting with the latter.
In our examples, we used the map operation to (implicitly) access the re-
lation by dereferencing the tuple identifiers. We can make the implicit access
explicit by exchanging the map operator by a d-join or even a join. Then, for
example,

χe:∗TID,name:e.name (Empsalary [x; 25 ≤ age ≤ 35; TID])

becomes

Empsalary [x; 25 ≤ age ≤ 35; TID] < χe:∗TID,name:e.name (2) >

where 2 returns a single empty tuple. Assume that every tuple contains an
attribute TID containing its TID. This attribute does not have to be stored
explicitly but can be derived. Then, we have the following alternative access
path for the join (ignoring projections):

Empsalary [x; 25 ≤ age ≤ 35] Bx.TID=e.TID Emp[e]

168CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

For the join operator, the pointer-based join implementation developed in the
context of object-oriented databases may be the most efficient way to evaluate
the access path [793]. Obviously, sorting the result of the index scan on the
tuple identifiers can speed up processing since it turns random into sequential
I/O. However, this destroys the order on the key which might itself be useful
later on during query processing or required by the query7 . Sorting the tuple
ToDo identifiers was proposed by, e.g., Yao [944], Makinouchi, Tezuka, Kitakami, and
Adachi in the context of RDB/V1 [568]. The different variants (whether or not
and where to sort, join order) can now be transparently determined by the plan
generator: no special treatment is necessary. Further, the join predicates can
not only be on the tuple identifiers but also on key attributes. This often allows
to join with other than the indexed relations (or their indexes) before accessing
the relation.
Rosenthal and Reiner proposed to use joins to represent access paths with
indexes [726]. This approach is very elegant since no special treatment for index
processing is required. However, if there are many relations and indexes, the
search space might become very large, as every index increases the number
of joins to be performed. This is why Mohan, Haderle, Wang, and Cheng
abondoned this approach and sketched a heuristics which determines an access
path in case multiple indexes on a single table exist [614].
The query

select name,age
from Person
where name like ’R%’ and age between 40 and 50

is an index only query (assuming indexes on name and age) and can be trans-
lated to

Πname,age (
Empage [a; 40 ≤ age ≤ 50; TIDa, age]
BTIDa=TIDn
Empname [n; name ≥0 R0 ; name ≤0 R0 ; TIDn, name])

Let us now discuss the former of the two issues mentioned in the section’s
introduction. The query

select *
from Emp e, Dept d
where e.name = ‘Maier’ and e.dno = d.dno

can be directly translated to

σe.name=0 Maier0 (Emp[e]) Be.dno=d.dno Dept[d]

7
Restoring the order may be cheaper than typical sorting since tuples can be numbered
before the first sort on tuple identifiers, and this dense numbering leads to efficient sort
algorithms.
4.14. INDEXES AND JOINS 169

If there are indexes on Emp.name and Dept.dno, we can replace σe.name=0 Maier0 (Emp[e])
by an index scan as we have seen previously:

χe:∗(x.T ID),A(Emp):e.∗ (Empname [x; name = ‘Maier0 ])

Here, A(Emp) : t.∗ abbreviates access to all Emp attributes. This especially
includes dno:t.dno. (Strictly speaking, we do not have to access the name
attribute, since its value is already known.)
As we have also seen, an alternative is to use a d-join instead:

Empname [x; name = ‘Maier0 ] < χt:∗(x.T ID),A(e)t.∗ (2) >

Let us abbreviate Empname [x; name = ‘Maier0 ] by Ei and χt:∗(x.T ID),A(e)t.∗ (2) by
Ea .
Now, for any e.dno, we can use the index on Dept.dno to access the ac-
cording department tuple:

Ei < Ea >< Deptdno [y; y.dno = dno] >

Note that the inner expression Deptdno [y; y.dno = dno] contains the free variable
dno, which is bound by Ea . Dereferencing the TID of the department results
in the following algebraic modelling which models a complete index nested loop
join:

Ei < Ea >< Deptdno [y; y.dno = dno; dTID : y.TID] >< χu:∗dTID,A(Dept)u.∗ (2) >

Let us abbreviate Deptdno [y; y.dno = dno; dTID : y.TID] by Di and χu:∗dTID,A(Dept)u.∗ (2)
by Da . Fully abbreviated, the expression then becomes

Ei < Ea >< Di >< Da >

Several optimizations can possibly be applied to this expression. Sorting

the outer of a d-join is useful under several circumstances since it may

• turn random I/O into sequential I/O and/or

• avoid reading the same page twice.

In our example expression,

• we can sort the result of expression Ei on TID in order to turn random

I/O into sequential I/O, if there are many employees named “Maier”.

• we can sort the result of the expression Ei < Ea > on dno for two reasons:

– If there are duplicates for dno, i.e. there are many employees named
“Maier” in each department, then this guarantees that no index page
(of the index Dept.dno) has to be read more than once.
– If additionally Dept.dno is a clustered index or Dept is an index-only
table contained in Dept.dno, then large parts of the random I/O can
be turned into sequential I/O.
170CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

– If the result of the inner is materialized (see below), then only one
result needs to be stored. Note that sorting is not necessary, but
grouping would suffice to avoid duplicate work.

• We can sort the result of the expression Ei < Ea >< Di > on dTID for
the same reasons as mentioned above for sorting the result of Ei on TID.

EX The reader is advised to explicitly write down the alternatives. Another exercise
is to give plan alternatives for the different cases of DB2’s Hybrid Join [318]
which can now be decomposed into primitives like relation scan, index scan,
d-join, sorting, TID dereferencing, and access to a unique index (see below).
Let us take a closer look at materializating the result of the inner of the d-
join. IBM’s DB2 for MVS considers temping (i.e. creating a temporary relation)
the inner if it is an index access [318]. Graefe provides a general discussion
on the subject [345]. Let us start with the above example. Typically, many
employees will work in a single department and possibly several of them are
called “Maier”. For everyone of them, we can be sure that there exists at most
one department. Let us assume that referential integrity has been specified.
Then, there exists exactly one department for every employee. We have to find
a way to rewrite the expression

Ei < Ea >< Deptdno [y; y.dno = dno; dTID : y.TID] >

such that the mapping dno−−→dTID is explicitly materialized (or, as one could
also say, cached ). For this purpose, Hellerstein and Naughton introduced a
modified version of the map operator that materializes its result [408]. Let us
denote this operator by χmat . The advantage of using this operator is that it is
quite general and can be used for different purposes (see e.g. [100], Chap. ??,
Chap. ??). Since the map operator extends a given input tuple by some at-
tribute values, which must be computed by an expression, we need one to
express the access to a unique index. For our example, we write

IdxAccDept
dno [y; y.dno = dno]

to express the lookup of a single (unique) entry in the index on Dept.dno. We

assume that the result is a (pointer to the) tuple containing the key attributes
and all data attributes including the TID of some tuple. Then, we have to
perform a further attribute access (dereferenciation) if we are interested in only
one of the attributes.
Now, we can rewrite the above expression to

Ei < Ea >< χmat

dT ID:(IdxAccDept [y;y.dno=dno]).TID
(2) >
dno

If we further assume that the outer (Ei < Ea >) is sorted on dno, then
it suffices to remember only the TID for the latest dno. We define the map
operator χmat,1 to do exactly this. A more efficient plan could thus be

Sortdno (Ei < Ea >) < χmat,1 (2) >

dT ID:(IdxAccDept
dno [y;y.dno=dno]).TID
4.14. INDEXES AND JOINS 171

where, strictly speaking, sorting is not necessary: grouping would suffice.

Consider a general expression of the form e1 < e2 >. The free variables
used in e2 must be a subset of the variables (attributes) produced by e1 , i.e.
F(e2 ) ⊆ A(e1 ). Even if e1 does not contain duplicates, the projection of e1 on
F(e2 ) may contain duplicates. If so, materialization could pay off. However, in
general, for every binding of the variables F(e2 ), the expression e2 may produce
several tuples. This means that using χmat is not sufficient. Consider the query

select *
from Emp e, Wine w
where e.yearOfBirth = w.year

If we have no indexes, we can answer this query by a simple join where we only
have to decide the join method and which of the relations becomes the outer and
which the inner. Assume we have only wines from a few years. (Alternatively,
some selection could have been applied.) Then it might make sense to consider
the following alternative:

Wine[w] < σe.yearOfBirth=w.year (Emp[e]) >

However, the relation Emp is scanned once for each Wine tuple. Hence, it might
make sense to materialize the result of the inner for every year value of Wine
if we have only a few year values. In other words, if we have many duplicates
for the year attribute of Wine, materialization may pay off since then we have
to scan Emp only once for each year value of Wine. To achieve caching of the
inner, in case every binding of its free variables possibly results in many tuples,
requires a new operator. Let us call this operator memox and denote it by M
[345, 100]. For the free variables of its only argument, it remembers the set
of result tuples produced by its argument expression and does not evaluate it
again if it is already cached. Using memox, the above plan becomes

Wine[w] < M(σe.yearOfBirth=w.year (Emp[e])) >

It should be clear that for more complex inners, the memox operator can be
applied at all branches, giving rise to numerous caching strategies. Analogously
to the materializing map operator, we are able to restrict the materialization
to the results for a single binding for the free variables if the outer is sorted (or
grouped) on the free variables:

Sortw.yearOfBirth (Wine[w]) < M1 (σe.yearOfBirth=w.year (Emp[e])) >

Things can become even more efficient if there is an index on Emp.yearOfBirth:

Sortw.yearOfBirth (Wine[w])
< M1 (EmpyearOfBirth [x; x.yearOfBirth = w.year] < χe:∗(x.TID),A(Emp):∗e (2) >) >

So far we have seen different operators which materialize values: Tmp, M,

and χmat . The latter in two variants. As an exercise, the reader is advised to
discuss the differences between them. EX
172CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

Assume, we have indexes on both Emp.yearOfBirth and Wine.year. Be-

sides the possibilities to use either Emp or Wine as the outer, we now also have
the possibility to perform a join on the indexes before accessing the actual Emp
and Wine tuples. Since the index scan produces its output ordered on the key
attributes, a simple merge join suffices (and we are back at the latter):

merge
EmpyearOfBirth [x] Bx.yearOfBirth=y.year Wineyear [y]

This example makes clear that the order provided by an index scan can be used
to speed up join processing. After evaluating this plan fragment, we have to
access the actual Emp and Wine tuples. We can consider zero, one, or two sorts
EX on their respective tuple identifiers. If the join is sufficiently selective, one of
these alternatives may prove more sufficient than the ones we have considered
so far.

4.15 Remarks on Access Path Generation

A last kind of optimization we briefly want to mention is sideways information
passing. Consider a simple join between two relations: R BR.a=S.b S. If we
decide to perform a sort merge join or a hash join, we can implement it by first
sorting/partitioning R before looking at S. While doing so, we can remember
the minimum and maximum value of R.a and use these as a restriction on S
such that fewer tuples of S have to be sorted/partitioned. In case we perform a
blockwise nested loop join, after the first scan of S we know the minimum and
maximum value of S.b and can use these to restrict R.
If the number of distinct values of R.a is small, we could also decide to
remember all these values and evaluate perform a semi-join before the actual
join. Algebraically, this could be expressed as

R BR.a=S.b (S NS.b=R.a ΠR.a (R))

An alternative is to use a bitmap to represent the projection of R on a.

The semi-join technique should be well-known from distributed database
systems. In deductive database systems, this kind of optimization often carries
the attribute magic. We will more deeply discuss this issue in Chapter ??.
The following problem is not discussed in the book. Assume that we have
fully partitioned a relation vertically into a set of files which are chronologically
ordered. Then, the attribute ai of the j-th tuple can be found at the j-th
position of the i-th file. This organizion is called partitioned transposed file [55].
(Compare this with variant (projection) indexes [638] and small materialized
aggregates [603].) The problem is to find an access strategy to all the attribute
required by the query given a collection of restriction on some of the relation’s
attributes. This problem has been discussed in depth by Batory [55]. Full
vertical partitioning is also used as the organizing principle of Monet []. Lately,
it also gained some interest in the US [].
4.16. COUNTING THE NUMBER OF ACCESSES 173

4.16 Counting the Number of Accesses

4.16.1 Counting the Number of Direct Accesses
After the index scan, we have a set of (distinct) tuple identifiers for which we
have to access the original tuples. The question we would like to answer is:

How many pages do we have to read?

Let R be the relation for which we have to retrieve the tuples. Then we use the
following abbreviations

N |R| number of tuples in the relation R

m ||R|| number of pages on which tuples of R are stored
B N/m number of tuples per page (blocking factor )
k number of (distinct) TIDs for which tuples have to be retrieved

We assume that the tuples are uniformly distributed among the m pages. Then,
each page stores B = N/m tuples. B is called blocking factor .
Let us consider some borderline cases. If k > N − N/m or m = 1, then all
pages are accessed. If k = 1 then exactly one page is accessed. The answer to
the general question will be expressed in terms of buckets (pages in the above
case) and items contained therein (tuples in the above case). Later on, we will
also use extents, cylinders, or tracks as buckets and tracks or sectors/blocks as
items.
We assume that a bucket contains items. The total number of items will be
N and the number of requested items will be k. The above question can then
be reformulated to how many buckets contain at least one of the k requested
items, i.e. how many qualifying buckets exist. We start out by investigating
the case where the items are uniformly distributed among the buckets. Two
subcases will be distinguished:

1. k distinct items are requested

2. k non-distinct items are requested.

We then discuss the case where the items are non-uniformly distributed.
In any case, the underlying access model is random access. For example,
given a tuple identifier, we can directly access the page storing the tuple. Other
access models are possible. The one we will subsequently investigate is sequen-
tial access where the buckets have to be scanned sequentially in order to find
the requested items. After that, we are prepared to develop a model for disk
access costs.
Throughout this section, we will further assume that the probability that
1 N
we request a set with k items is N for all of the k possibilities to select
(k)
8
a k-set. We often make use of established equalities for binomial coefficients.
For convenience, the most frequently used equalities are listed in Appendix D.
8
A k-set is a set with cardinality k.
174CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

Selecting k distinct items

Our first theorem was discovered independently by Waters [900] and Yao [941].
We formulate it in terms of buckets containing items. We say a bucket qualifies
if it contains at least one of the k items we are looking for.
Theorem 4.16.1 (Waters/Yao) Consider m buckets with n items each. Then
there is a total of N = nm items. If we randomly select k distinct items from
all items, then the number of qualifying buckets is
N,m
Yn (k) = m ∗ YnN (k) (4.2)
where YnN (k) is the probability that a bucket contains at least one item. This
probability is equal to

N [1 − p] k ≤N −n
Yn (k) =
1 k >N −n
where p is the probability that a bucket contains none of the k items. The
following alternative expressions can be used to calculate p:
N −n

k
p = N
(4.3)
k
k−1
Y N −n−i
= (4.4)
N −i
i=0
n−1
Y N −k−i
= (4.5)
N −i
i=0

The second expression (4.4) is due to Yao, the third (4.5) is due to Waters.
Palvia and March proved both formulas to be equal [656] (see also [38]). The
fraction m = N/n may not be an integer. For these cases, it is advisable to
have a Gamma-function based implementation of binomial coeffcients at hand
(see [689] for details).
Depending on k and n, either the expression of Yao or the one of Waters is
faster to compute. After the proof of the above formulas and the discussion of
some special cases, we will give several approximations for p.

Proof
The total number of possibilities to pick the k items from all N items is
N
k . The number of possibilities to pick k items from all items not contained in

a fixed single bucket is N k−n . Hence, the probability p that a bucket does not

qualify is p = N k−n / Nk . Using this result, we can do the following calculation
N −n

k
p = N

k
(N − n)! k!(N − k)!
=
k!((N − n) − k)! N !
k−1
Y N −n−i
=
N −i
i=0
4.16. COUNTING THE NUMBER OF ACCESSES 175

which proves the second expression. The third follows from

N −n

k
p = N

k
(N − n)! k!(N − k)!
=
k!((N − n) − k)! N !
(N − n)! (N − k)!
=
N ! ((N − k) − n)!
n−1
Y N −k−i
=
N −i
i=0

2
Let us list some special cases:

If then YmN (k) =

n=1 k/N
n=N 1
k=0 0
k=1 B/N = (N/m)N = 1/m
k=N 1

We examine a slight generalization of the first case in more detail. Let N items
be distributed over N buckets such that every bucket contains exactly one item.
Further let us be interested in a subset of m buckets (1 ≤ m ≤ N ). If we pick
k items, then the number of buckets within the subset of size m that qualify is

k
mY1N (k) = m (4.6)
N
In order to see that the two sides are equal, we perform the following calculation:
N −1

Y1N (k) = (1 − k
N
)
k
(N −1)!
k!((N −1)−k)!
= (1 − N!
)
k!(N −k)!
(N − 1)!k!(N − k)!
= (1 − )
N !k!((N − 1) − k)!
N −k
= (1 − )
N
N N −k
= ( − )
N N
N −N +k
=
N
k
=
N
176CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

Since the computation of YnN (k) can be quite expensive, several approxima-
tions have been developed. The first one was given by Waters [899, 900]:

p ≈ (1 − k/N )n

This approximation (also described elsewhere [313, 656]) turns out to be pretty
good. However, below we will see even better approximations.
N,m
For Y n (k) Whang, Wiederhold, and Sagalowicz gave the following ap-
proximation for faster calculation [912]:

m ∗ [ (1 − (1 − 1/m)k )+
(1/(m2 n) ∗ k(k − 1)/2 ∗ (1 − 1/m)k−1 )+
(1.5/(m3 n4 ) ∗ k(k − 1)(2k − 1)/6 ∗ (1 − 1/m)k−1 ) ]

A rough estimate is presented by Bernstein, Goodman, Wong, Reeve, and Roth-

nie [77]: 
 k if k < m2
N,m k+m m
Y n (k) ≈ if ≤ k < 2m
 2 2
m if 2m ≤ k
An interesting and useful result was derived by Dihr and Saharia [237]. They
give two formulas and show that they are lower and upper bounds to Water
and Yao’s formula. The upper and lower bounds for p are

k
plower = (1 − )n
N − n−1
2
k k
pupper = ((1 − ) ∗ (1 − ))n/2
N N −n+1
for n = N/m. Dihr and Saharia claim that the maximal difference resulting
from the use of the lower and the upper bound to compute the number of page
accesses is 0.224—far less than a single page access.

Selecting k non-distinct items

So far, we assumed that we retrieve k distinct items. We could ask the same
question for k non-distinct items. This question demands a different urn mod-
el. In urn model terminology, the former case is an urn model with a non-
replacement assumption, while the latter case is one with a replacement as-
sumption. (Deeper insight into urn models is given by Drmota, Gardy, and
Gittenberger [244].)
Before presenting a theorem discovered by Cheung [170], we repeat a the-
orem from basic combinatorics. We know that the number of subsets of size
k of a set with N elements is Nk . The following lemma gives us the number
of k-multisets9 (see, e.g. [818]). The
number
of k-multisets taken from a set S
N
with |S| elements is denoted by .
k
9
A k-multiset is a multiset with k elements.
4.16. COUNTING THE NUMBER OF ACCESSES 177

Lemma 4.16.2 Let S be a set with |S| = N elements. Then, the number of
multisets with cardinality k containing only elements from S is

N N +k−1
=
k k

For a proof we just note that there is a bijection between the k-multisets and
the k-subsets of a N + k − 1-set. We can go from a multiset to a set by f with
f ({x1 ≤ . . . ≤ xk }) = {x1 + 0 < x2 + 1 < . . . < xk + (k − 1)} and from a set to a
multiset via g with g({x1 < . . . < xk }) = {x1 − 0 < x2 − 1 < . . . < xk − (k − 1)}.

Theorem 4.16.3 (Cheung) Consider m buckets with n items each. Then

there is a total of N = nm items. If we randomly select k not necessarily
distinct items from all items, then the number of qualifying buckets is

N,m
Cheungn (k) = m ∗ CheungN
n (k) (4.7)

where
CheungN
n (k) = [1 − p̃] (4.8)

with the following equivalent expressions for p̃:

N −n+k−1

k
p̃ = N +k−1
(4.9)
k
k−1
Y N −n+i
= (4.10)
N +i
i=0
n−1
Y N −1−i
= (4.11)
N −1+k−i
i=0

Eq. 4.9 follows from the observation that the probability that some bucket
(N −n+k−1
k )
does not contain any of the k possibly duplicate items is N +k−1 . Eq. 4.10
( k )
follows from

N −n+k−1

k
p̃ = N +k−1

k
(N − n + k − 1)! k!((N + k − 1) − k)!
=
k!((N − n + k − 1) − k)! (N + k − 1)!
(N − n − 1 + k)! (N − 1)!
=
(N − n − 1)! (N − 1 + k)!
k−1
Y N −n+i
=
N +i
i=0
178CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

Eq. 4.11 follows from

N −n+k−1

k
p̃ = N +k−1

k
(N − n + k − 1)! k!((N + k − 1) − k)!
=
k!((N − n + k − 1) − k)! (N + k − 1)!
(N + k − 1 − n)! (N − 1)!
=
(N + k − 1)! (N − 1 − n)!
n−1
Y N −n+i
=
N +k−n+i
i=0
n−1
Y N −1−i
=
N −1+k−i
i=0

2
Cardenas discovered a formula that can be used to approximate p̃ [123]:

(1 − n/N )k

As Cheung pointed out, we can use the theorem to derive the number of
distinct items accessed contained in a k-multiset.

Corollary 4.16.4 Let S be a k-multiset containing elements from an N -set T .

Then the number of distinct items contained in S is

Nk
D(N, k) = (4.12)
N +k−1
if the elements in T occur with the same probability in S.

We apply the theorem for the specialQ0 case where every bucket contains exactly
N −1−i −1
one item (n = 1). In this case, i=0 N −1+k−i = NN−1+k . And the number of
N −1 N −1+k−N +1
qualifying buckets is N (1 − N −1+k ) = N ( N −1+k ) = N N +k−1 k
. 2
N

Another way to achieve this formula is the following. There are l pos-
sibilities to pick l different elements out of the N elements in T . In order to
build a k-multiset with l different elements, we mustadditionally choose n − l
N
l
elements from the l elements. Thus, we have l possibilities to
n−l

N
build a k-multiset. The total number of multisets is . Thus we may
l
conclude that
N
l
min(N,k)
X l n−l
D(N, k) = l
l=1
N
l
which can be simplified to the above.
4.16. COUNTING THE NUMBER OF ACCESSES 179

A useful application of this formula is to calculate the size of a projection

[170]. Another use is that calculating the number of distinct values contained
in a multiset allows us to shift from the model with replacement to a model
without replacement. However, there is a difference between
N,m N,m
Yn (Distinct(N, k)) ≈ Cheungn (k)

even when computing Y with Eq. 4.5. Nonetheless, for n ≥ 5, the error is
less than two percent. One of the problems when calculating the result of the
left-hand side is that the number of distinct items is not necessarily an integer.
To solve this problem, we can implement all our formulas using the Gamma-
function. But even then a small difference remains.
The approximation given in Theorem 4.16.3 is not too accurate. A better
approximation can be calculated from the probability distribution. Denote by
p(D(N, k) = j) the probability that the number of distinct values if we randomly
select k items with replacement from N given items equals j. Then
X
N k j
p(D(N, k) = j) = j(−1) ((j − l)/N )k
j l
l=0

and thus
min(N,k)
X X
N j
D(N, k) = j j(−1)k ((j − l)/N )k
j l
j=1 l=0

This formula is quite intense to calculate. We can derive a very good approx-
imation by the following reasoning. We draw k elements from the set T with
|T | = N elements. Every element from T can be drawn at most k times. We
produce N buckets, one for each element of T . In each bucket, we insert k
copies of the according element from t. Then, a sequence of draws from T
with duplicates can be represented by a sequence of draws without duplicate
by mapping them to different copies. Thus, the first occurrence is mapped to
the first element in the according bucket, the second one to the second copy
and so on. Then, we can apply formula by Waters and Yao to calculate the
number of buckets (and hence elements of T ) hit:
N k,k
D(N, k) = Y N (k)

Since the approximation is quite accurate and we already know how to efficiently
calculate this formula, this is our method of choice.

Non-Uniform Distribution of Items

In the previous sections, we assumed that

1. every page contains the same number of records, and

2. every record is accessed with the same probability.

180CHAPTER 4. DATABASE ITEMS, BUILDING BLOCKS, AND ACCESS PATHS

We now turn to relax the first assumption. Christodoulakis models the distri-
bution by m numbers ni (for 1 ≤ i ≤ m) if there are m buckets. Each ni equals
the number of records in some bucket i [173]. Luk proposes Zipfian record
distribution [561]. However, Ijbema and Blanken say that Water and Yao’s
formula is still better, as Luk’s formula results in too low values [434]. They all
come up with the same general formula presented below. Vander Zander, Tay-
lor, and Bitton [955] discuss the problem of correlated attributes which results
in some clusteredness. Zahorjan, Bell, and Sevcik discuss the problem where
every item is assigned its own access probability [954]. That is, they relax the
second assumption. We will come back to these issues in Section ??.
We still assume that every item is accessed with the same probability.
However, we relax the first assumption. The following formula derived by
Christodoulakis [173], Luk [561], and Ijbema and Blanken [434] is a simple
application of Waters’s and Yao’s formula to a more general case.

Theorem 4.16.5 (Yao/Waters/Christodoulakis) Assume a set of m buck-

ets. Each
Pm bucket contains nj > 0 items (1 ≤ j ≤ m). The total number of items
is N = j=1 nj . If we look up k distinct items, then the probability that bucket
j qualifies is N −nj
WnNj (k, j) = [1 − k
N
] (= YnNj (k))