CSE 544
Principles of Database
Management Systems
Fall 2016
Lecture 8 - Query optimization
Announcements
• HW2 (SimpleDB) is due next Friday!
• Midterm in two weeks, Nov. 10, in class
• Project Milestone due on Nov. 16
CSE 544 - Fall 2016 2
Recommended Readings
Access path selection in a relational database management
system.
Selinger. et. al. SIGMOD 1979
Additional resources:
• Chaudhuri, "An Overview of Query Optimization in
Relational Systems," Proceedings of ACM PODS, 1998
• Database management systems.
Ramakrishnan and Gehrke.
Third Ed. Chapter 15.
CSE 544 - Fall 2016 3
Query Optimization Motivation
SQL query
Declarative query
Recall physical and
Parse & Rewrite Query logical data independence
Logical
Query Select Logical Plan
plan
optimization
Select Physical Plan
Physical
plan
Query Execution
Disk 4
What We Already Know…
Supplier(sno,sname,scity,sstate)
Part(pno,pname,psize,pcolor)
Supply(sno,pno,price)
For each SQL query….
SELECT S.sname
FROM Supplier S, Supply U
WHERE S.scity='Seattle' AND S.sstate='WA’
AND S.sno = U.sno
AND U.pno = 2
There exist many logical query plan…
CSE 544 - Fall 2016 5
Example Query: Logical Plan 1
π sname
σ sscity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
sno = sno
Supplier Supply
CSE 544 - Fall 2016 6
Example Query: Logical Plan 2
π sname
sno = sno
σ sscity=‘Seattle’ ∧sstate=‘WA’ σ pno=2
Supplier Supply
CSE 544 - Fall 2016 7
What We Also Know
• For each logical plan…
• There exist many physical plans
CSE 544 - Fall 2016 8
Example Query: Physical Plan 1
(On the fly) π sname
(On the fly)
σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
(Nested loop)
sno = sno
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 9
Example Query: Physical Plan 2
(On the fly) π sname
(On the fly)
σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
(Index nested loop)
sno = sno
Supplier Supply
(File scan) (Index scan)
CSE 544 - Fall 2016 10
Query Optimization
Three major components:
1. Cardinality and cost estimation
2. Search space
3. Plan enumeration algorithms
CSE 544 - Fall 2016 11
Estimating Cost of a Query Plan
Goal: compute the cost of an entire physical query plan
• We already know how to
– Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s
• We still need to do
– Access path selection: compute cost of retrieving tuples from disk
with different access paths
– Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 12
Access Path
Access path: a way to retrieve tuples from a table
• A file scan
• An index plus a matching selection condition
CSE 544 - Fall 2016 13
Access Path Selection
• Supplier(sid,sname,scity,sstate)
• Selection condition: sid > 300 ∧ scity=‘Seattle’
• Indexes: B+-tree on sid and B+-tree on scity
• Which access path should we use?
• We should pick the most selective access path
CSE 544 - Fall 2016 14
Access Path Selectivity
• Access path selectivity is the number of pages
retrieved if we use this access path
– Most selective retrieves fewest pages
• As we saw earlier, for equality predicates
– Selection on equality: σa=v(R)
– V(R, a) = # of distinct values of attribute a
– 1/V(R,a) is thus the reduction factor
– Clustered index on a: cost B(R)/V(R,a)
– Unclustered index on a: cost T(R)/V(R,a)
– (we are ignoring I/O cost of index pages for simplicity)
CSE 544 - Fall 2016 15
Selectivity for Range Predicates
Selection on range: σa>v(R)
• How to compute the selectivity?
• Assume values are uniformly distributed
• Reduction factor X
• X = (Max(R,a) - v) / (Max(R,a) - Min(R,a))
• Clustered index on a: cost B(R)*X
• Unclustered index on a: cost T(R)*X
CSE 544 - Fall 2016 16
Back to Our Example
• Selection condition: sid > 300 ∧ scity=‘Seattle’
– Index I1: B+-tree on sid clustered
– Index I2: B+-tree on scity unclustered
• Let’s assume
– V(Supplier,scity) = 20
– Max(Supplier, sid) = 1000, Min(Supplier,sid)=1
– B(Supplier) = 100, T(Supplier) = 1000
• Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 ≈ 70
• Cost I2: T(R) * 1/V(Supplier,scity) = 1000/20 = 50
CSE 544 - Fall 2016 17
Selectivity with
Multiple Conditions
What if we have an index on multiple attributes?
• Example selection σa=v1 ∧ b= v2(R) and index on <a,b>
How to compute the selectivity?
• Assume attributes are independent
• X = 1 / (V(R,a) * V(R,b))
• Clustered index on <a,b>: cost B(R)*X
• Unclustered index on <a,b>: cost T(R)*X
CSE 544 - Fall 2016 18
Estimating Cost of a Query Plan
Goal: compute the cost of an entire physical query plan
• We already know how to
– Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s
• We still need to do
– Access path selection: compute cost of retrieving tuples from disk
with different access paths
– Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 19
Statistics on Base Data
• Collected information for each relation
– Number of tuples (cardinality) T(R)
– Number of physical pages B(R), clustering info
– Indexes, number of keys in the index V(R,a)
– Statistical information on attributes
• Min value, max value, number distinct values
• Histograms
– Correlations between columns (hard)
• Collection approach: periodic, using sampling
CSE 544 - Fall 2016 20
Size Estimation
Projection: output size same as input size
Selection: multiply input size by reduction factor
• Similar to what we did for estimating access path
selectivity
• Assume independence between conditions in the
predicate
• Examples:
T(σA=...(R)) = T(R) / V(R,A)
T(σA=...∧ B=... (R)) = T(R) / (V(R,A) * V(R,B))
CSE 544 - Fall 2016 21
Estimating Result Sizes
Join R ⋈ S
• Take product of cardinalities of relations R and S
• Apply reduction factors for each term in join condition
• Terms are of the form: column1 = column2
• Reduction: 1/ ( MAX( V(R,column1), V(S,column2))
• Why? Will explain next...
CSE 544 - Fall 2016 22
Assumptions
• Containment of values: if V(R,A) <= V(S,B), then
the set of A values of R is included in the set of B
values of S
– Note: this indeed holds when A is a foreign key in R,
and B is a key in S
• Preservation of values: for any other attribute C,
V(R ⨝A=B S, C) = V(R, C) (or V(S, C))
CSE 544 - Fall 2016 23
Selectivity of R ⨝A=B S
Assume V(R,A) <= V(S,B)
• Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S
• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)
In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))
CSE 544 - Fall 2016 24
Complete Example
Supplier(sid, sname, scity, sstate) SELECT sname
Supply(sid, pno, quantity) FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
• Some statistics
and x.scity = ‘Seattle’
– T(Supplier) = 1000 records and x.sstate = ‘WA’
– T(Supply) = 10,000 records
– B(Supplier) = 100 pages
– B(Supply) = 100 pages
– V(Supplier,scity) = 20, V(Suppliers,state) = 10
– V(Supply,pno) = 2,500
– Both relations are clustered
• M = 11
CSE 544 - Fall 2016 25
Computing the Cost of a Plan
• Estimate cardinality in a bottom-up fashion
– Cardinality is the size of a relation (nb of tuples)
– Compute size of all intermediate relations in plan
• Estimate cost by using the estimated cardinalities
CSE 544 - Fall 2016 26
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 1
(On the fly) π sname Selection and project on-the-fly
-> No additional cost.
(On the fly)
σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
Total cost of plan is thus cost of join:
= B(Supplier)+B(Supplier)*B(Supplies)
(Nested loop) = 100 + 100 * 100
sno = sno
= 10,100 I/Os
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 27
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 2
Total cost
(On the fly) π sname (4) = 100 + 100 * 1/20 * 1/10 (1)
+ 100 + 100 * 1/2500 (2)
+ 2 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 204 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 28
Plan 2 with Different Numbers
Total cost
What if we had: π sname (4) = 10000 + 50 (1)
10K pages of Suppliers
+ 10000 + 4 (2)
10K pages of Supplies
+ 4*50 + 2*4 + 4 + 50 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 20,316 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2
Assuming naive
Supplier Supply two-pass sort
(File scan) (File scan) algorithm
CSE 544 - Fall 2016 29
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 3
(On the fly) (4) π sname
Total cost
= 1 (1)
(On the fly)
+ 4 (2)
(3) σ scity=‘Seattle’ ∧sstate=‘WA’ + 0 (3)
+ 0 (3)
Total cost ≈ 5 I/Os
(2) (Index nested loop)
sno = sno
(Use hash index) 4 tuples
(1) σ pno=2
Supply Supplier
(Hash index on pno ) (Hash index on sno)
30
Assume: clustered Clustering does not matter
Simplifications
• In the previous examples, we assumed that all index
pages were in memory
• When this is not the case, we need to add the cost of
fetching index pages from disk
CSE 544 - Fall 2016 31
Different Cost Models
• In previous examples, we considered IO costs
• Typically, want IO+CPU
• For parallel/distributed queries, add network bandwidth
• If need to compare logical plans
– Compute the cardinality of each intermediate relation
– Sum up all the cardinalities
CSE 544 - Fall 2016 32
Summary
Goal: compute the cost of an entire physical query plan
• We already know how to
– Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s
• We still need to do
– Access path selection: compute cost of retrieving tuples from disk
with different access paths
– Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 33
Query Optimization
Three major components:
1. Cardinality and cost estimation
2. Search space
3. Plan enumeration algorithms
CSE 544 - Fall 2016 34
Relational Algebra Laws
• Selections
– Commutative: σc1(σc2(R)) same as σc2(σc1(R))
– Cascading: σc1∧c2(R) same as σc2(σc1(R))
• Projections
– Cascading
• Joins
– Commutative : R ⋈ S same as S ⋈ R
– Associative: R ⋈ (S ⋈ T) same as (R ⋈ S) ⋈ T
CSE 544 - Fall 2016 35
Left-Deep Plans and
Bushy Plans
R2
R4
R3 R1 R3 R1 R2 R4
Left-deep plan Bushy plan
CSE 544 - Fall 2016 36
Relational Algebra Laws
• Selects, projects, and joins
– We can commute and combine all three types of operators
– We just have to be careful that the fields we need are available
when we apply the operator
– Relatively straightforward. See book 15.3.
• More info in optional paper (by Chaudhuri), Section 4.
CSE 544 - Fall 2016 37
Group-by and Join
R(A, B), S(C,D)
γA, sum(D)(R(A,B) ⨝ B=C S(C,D)) = ?
CSE 544 - Fall 2016 38
Group-by and Join
R(A, B), S(C,D)
γA, sum(D)(R(A,B) ⨝ B=C S(C,D)) =
γA, sum(D)(R(A,B) ⨝ B=C (γC, sum(D)S(C,D)))
These are very powerful laws.
They were introduced only in the 90’s.
CSE 544 - Fall 2016 39
Search Space Challenges
• Search space is huge!
– Many possible equivalent trees (logical)
– Many implementations for each operator (physical)
– Many access paths for each relation (physical)
• Cannot consider ALL plans
• Want a search space that includes low-cost plans
• Typical compromises:
– Only left-deep plans
– Only plans without cartesian products
– Always push selections down to the leaves
40
Query Optimization
Three major components:
1. Cardinality and cost estimation
2. Search space
3. Plan enumeration algorithms
CSE 544 - Fall 2016 41
Two Types of Optimizers
• Heuristic-based optimizers:
– Apply greedily rules that always improve plan
• Typically: push selections down
– Very limited: no longer used today
• Cost-based optimizers:
– Use a cost model to estimate the cost of each plan
– Select the “cheapest” plan
– We focus on cost-based optimizers
CSE 544 - Fall 2016 42
Three Approaches to Search
Space Enumeration
• Complete plans
• Bottom-up plans
• Top-down plans
CSE 544 - Fall 2016 43
Complete Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
⨝
⨝
⨝ Why is this
T search space
σA<40 ⨝ inefficient ?
σA<40 S
R S T
R
CSE 544 - Fall 2016 44
Bottom-up Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
Why is this ⨝
better ?
⨝ ⨝ T
σA<40 ⨝ σA<40 S ⨝ σA<40 S
…..
R S T R R S R
CSE 544 - Fall 2016 45
Top-down Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
⨝ ⨝ σA<40
⨝
T T SELECT R.A, T.D
SELECT * FROM R, S, T
FROM R, S WHERE R.B=S.B
WHERE R.B=S.B
and R.A < 40 SELECT * S
and S.C=T.C …..
FROM R
WHERE R.A < 40
CSE 544 - Fall 2016 46
Two Types of Plan
Enumeration Algorithms
• Dynamic programming (in class)
– Based on System R (aka Selinger) style optimizer[1979]
– Limited to joins: join reordering algorithm
– Bottom-up
• Rule-based algorithm (will not discuss)
– Database of rules (=algebraic laws)
– Usually: dynamic programming
– Usually: top-down
CSE 544 - Fall 2016 47
System R Search Space
• Only left-deep plans
– Enable dynamic programming for enumeration
– Facilitate tuple pipelining from outer relation
• Consider plans with all “interesting orders”
• Perform cross-products after all other joins (heuristic)
• Only consider nested loop & sort-merge joins
• Consider both file scan and indexes
• Try to evaluate predicates early
CSE 544 - Fall 2016 48
Plan Enumeration Algorithm
• Idea: use dynamic programming
• For each subset of {R1, …, Rn}, compute the best plan
for that subset
• In increasing order of set cardinality:
– Step 1: for {R1}, {R2}, …, {Rn}
– Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
– …
– Step n: for {R1, …, Rn}
• It is a bottom-up strategy
• A subset of {R1, …, Rn} is also called a subquery
CSE 544 - Fall 2016 49
Dynamic Programming Algo.
• For each subquery Q ⊆{R1, …, Rn} compute the
following:
– Size(Q)
– A best plan for Q: Plan(Q)
– The cost of that plan: Cost(Q)
CSE 544 - Fall 2016 50
Dynamic Programming Algo.
• Step 1: Enumerate all single-relation plans
– Consider selections on attributes of relation
– Consider all possible access paths
– Consider attributes that are not needed
– Compute cost for each plan
– Keep cheapest plan per “interesting” output order
CSE 544 - Fall 2016 51
Dynamic Programming Algo.
• Step 2: Generate all two-relation plans
– For each each single-relation plan from step 1
– Consider that plan as outer relation
– Consider every other relation as inner relation
– Compute cost for each plan
– Keep cheapest plan per “interesting” output order
CSE 544 - Fall 2016 52
Dynamic Programming Algo.
• Step 3: Generate all three-relation plans
– For each each two-relation plan from step 2
– Consider that plan as outer relation
– Consider every other relation as inner relation
– Compute cost for each plan
– Keep cheapest plan per “interesting” output order
• Steps 4 through n: repeat until plan contains all the
relations in the query
CSE 544 - Fall 2016 53
Commercial Query Optimizers
DB2, Informix, Microsoft SQL Server, Oracle 8
• Inspired by System R
– Left-deep plans and dynamic programming
– Cost-based optimization (CPU and IO)
• Go beyond System R style of optimization
– Also consider right-deep and bushy plans (e.g., Oracle and DB2)
– Variety of additional strategies for generating plans (e.g., DB2
and SQL Server)
CSE 544 - Fall 2016 54
Other Query Optimizers
• Randomized plan generation
– Genetic algorithm
– PostgreSQL uses it for queries with many joins
• Rule-based
– Extensible collection of rules
– Rule = Algebraic law with a direction
– Algorithm for firing these rules
• Generate many alternative plans, in some order
• Prune by cost
– Startburst (later DB2) and Volcano (later SQL Server)
CSE 544 - Fall 2016 55