CSE544
Data Management
Lectures 9
Query Optimization (Part 1)
CSEP 544 - Spring 2021 1
Announcements
• HW2 is due tomorrow!
• HW3 will be posted on Wednesday
• Review 5 (How good?) due Wednesday
• Mini-project guidelines posted
CSEP 544 - Spring 2021 2
Query Optimization Motivation
SQL query
Declarative query
Recall physical and
Parse & Rewrite Query logical data independence
Logical
Query Select Logical Plan
plan
optimization
Select Physical Plan
Physical
plan
Query Execution
Disk 3
Query Optimization
Goal:
• Given a query plan, find a cheaper
(cheapest?) equivalent plan
• Why difficult:
– Need to explore a large number of plans
– Need to estimate the cost of each plan
4
Query Optimization
Three major components:
1. Cardinality and cost estimation
2. Search space
3. Plan enumeration algorithms
CSEP 544 - Spring 2021 5
Query Optimization
Three major components:
1. Cardinality and cost estimation
2. Search space
3. Plan enumeration algorithms
CSEP 544 - Spring 2021 6
Cardinality Estimation
Problem: given statistics on base tables
and a query, estimate size of the answer
Very difficult, because:
• Need to do it very fast
• Need to use very little memory
CSEP 544 - Spring 2021 7
Statistics on Base Data
• Number of tuples (cardinality) T(R)
• Number of physical pages B(R)
• Indexes, number of keys in the index V(R,a)
• Histogram on single attribute (1d)
• Histogram on two attributes (2d)
Computed periodically, often using sampling
CSEP 544 - Spring 2021 8
Assumptions
• Uniformity
• Independence
• Containment of values
• Preservation of values
CSEP 544 - Spring 2021 9
Size Estimation
Selection: size decreases by selectivity factor θ
T(σpred(R)) = θpred * T(R)
CSEP 544 - Spring 2021 10
Size Estimation
Selection: size decreases by selectivity factor θ
T(σpred(R)) = θpred * T(R)
T(R ⨝A=B S) = θA=B * T(R) * T(S)
CSEP 544 - Spring 2021 11
T(σpred(R)) = θpred * T(R)
Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
12
T(σpred(R)) = θpred * T(R)
Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
13
T(σpred(R)) = θpred * T(R)
Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
σc1<A<c2(R)
Range:
• θc1<A<c2 = (c2 – c1)/(max(R,A) - min(R,A))
14
T(σpred(R)) = θpred * T(R)
Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
σc1<A<c2(R)
Range:
• θc1<A<c2 = (c2 – c1)/(max(R,A) - min(R,A))
Conjunction σA=c and B=d(R)
T(σpred(R)) = θpred * T(R)
Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
σc1<A<c2(R)
Range:
• θc1<A<c2 = (c2 – c1)/(max(R,A) - min(R,A))
Conjunction σA=c and B=d(R)
Independence assumption
• θpred1 and pred2 = θpred1*θpred2 = 1/V(R,A) * 1/V(R,B)
T(R ⨝A=B S) = θA=B * T(R) * T(S)
Selectivity Factors
R ⋈R.A=S.B S
Join
CSEP 544 - Spring 2021 17
T(R ⨝A=B S) = θA=B * T(R) * T(S)
Selectivity Factors
R ⋈R.A=S.B S
Join
• θ R.A=S.B = 1/ ( MAX( V(R,A), V(S,B))
Why? Will explain next...
CSEP 544 - Spring 2021 18
T(R ⨝A=B S) = θA=B * T(R) * T(S)
Selectivity Factors
R ⋈R.A=S.B S
Containment of values: if V(R,A) ≤ V(S,B), then
the set of A values of R is included in the set of
B values of S
• Note: this indeed holds when A is a foreign
key in R, and B is a key in S
CSEP 544 - Spring 2021 19
T(R ⨝A=B S) = θA=B * T(R) * T(S)
Selectivity Factors
R ⋈R.A=S.B S
Assume V(R,A) ≤ V(S,B)
• Tuple t in R joins with T(S)/V(S,B) tuples in S
CSEP 544 - Spring 2021 20
T(R ⨝A=B S) = θA=B * T(R) * T(S)
Selectivity Factors
R ⋈R.A=S.B S
Assume V(R,A) ≤ V(S,B)
• Tuple t in R joins with T(S)/V(S,B) tuples in S
• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)
CSEP 544 - Spring 2021 21
T(R ⨝A=B S) = θA=B * T(R) * T(S)
Selectivity Factors
R ⋈R.A=S.B S
Assume V(R,A) ≤ V(S,B)
• Tuple t in R joins with T(S)/V(S,B) tuples in S
• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)
In general:
• T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))
• θ R.A=S.B = 1/ ( max( V(R,A), V(S,B))
CSEP 544 - Spring 2021 22
Final Assumption
Preservation of values:
For any other attribute C:
• V(R ⨝A=B S, C) = V(R, C) or
• V(R ⨝A=B S, C) = V(S, C)
• This is needed higher up in the plan
CSEP 544 - Spring 2021 23
Computing the Cost of a Plan
• Estimate cardinalities bottom-up
• Estimate cost by using estimated cardinalities
• Examples next...
CSEP 544 - Spring 2021 24
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 1
πsname
SELECT sname
FROM Supplier x, Supply y
WHERE [Link] = [Link]
and [Link] = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and [Link] = ‘Seattle’
and [Link] = ‘WA’
sid = sid
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 1
πsname
SELECT sname
Estimated FROM Supplier x, Supply y
(why?) WHERE [Link] = [Link]
and [Link] = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and [Link] = ‘Seattle’
and [Link] = ‘WA’
T = 10000
sid = sid
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 1
πsname
SELECT sname
Estimated FROM Supplier x, Supply y
(why?) WHERE [Link] = [Link]
and [Link] = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and [Link] = ‘Seattle’
and [Link] = ‘WA’
T = 10000
Because key / foreign-key
sid = sid
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 1
πsname
SELECT sname
Estimated FROM Supplier x, Supply y
(why?) WHERE [Link] = [Link]
and [Link] = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and [Link] = ‘Seattle’
and [Link] = ‘WA’
T = 10000
Because key / foreign-key
sid = sid
Also: θ = 1/max(V(Supply,sid)*V(Supplier,sid) = 1/V(Supplier,sid)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Estimated
(why?) Logical Query Plan 1
πsname
SELECT sname
T <1 FROM Supplier x, Supply y
WHERE [Link] = [Link]
and [Link] = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and [Link] = ‘Seattle’
and [Link] = ‘WA’
T = 10000
sid = sid
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 2
πsname
SELECT sname
FROM Supplier x, Supply y
WHERE [Link] = [Link]
and [Link] = 2
and [Link] = ‘Seattle’
and [Link] = ‘WA’
sid = sid
σpno=2 σscity=‘Seattle’∧sstate=‘WA’
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 2
πsname
SELECT sname
FROM Supplier x, Supply y
WHERE [Link] = [Link]
and [Link] = 2
and [Link] = ‘Seattle’
and [Link] = ‘WA’
T=4 sid = sid T= 5
σpno=2 σscity=‘Seattle’∧sstate=‘WA’
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 2
πsname
SELECT sname
FROM Supplier x, Supply y
WHERE [Link] = [Link]
and [Link] = 2
and [Link] = ‘Seattle’
and [Link] = ‘WA’
T=4 sid = sid T= 5
Very wrong!
Why?
σpno=2 σscity=‘Seattle’∧sstate=‘WA’
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 2
πsname
SELECT sname
FROM Supplier x, Supply y
T=4 WHERE [Link] = [Link]
and [Link] = 2
and [Link] = ‘Seattle’
and [Link] = ‘WA’
T=4 sid = sid T= 5
Very wrong!
Why?
σpno=2 σscity=‘Seattle’∧sstate=‘WA’
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Logical Query Plan 2
πsname
Different
estimate L SELECT sname
FROM Supplier x, Supply y
T=4 WHERE [Link] = [Link]
and [Link] = 2
and [Link] = ‘Seattle’
and [Link] = ‘WA’
T=4 sid = sid T= 5
Very wrong!
Why?
σpno=2 σscity=‘Seattle’∧sstate=‘WA’
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 1
πsname
T <1
σpno=2∧scity=‘Seattle’∧sstate=‘WA’
T = 10000
Total cost: 100/10 * 100 = 1000
sid = sid
Block nested loop join
Scan
Supply Scan
Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 1
πsname
T <1
σpno=2∧scity=‘Seattle’∧sstate=‘WA’
T = 10000
Total cost: 100+100*100/10 = 1100
sid = sid
Block nested loop join
Scan
Supply Scan
Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 2
πsname
Cost of Supply(pno) = 4
T=4 Cost of Supplier(scity) = 50
Total cost: 54
T= 5
T=4 sid = sid
Main memory join
σsstate=‘WA’
Unclustered σpno=2 T= 50
index lookup σscity=‘Seattle’ Unclustered
Supply(pno) index lookup
Supply Supplier Supplier(scity)
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 2
πsname
Cost of Supply(pno) = 4
T=4 Cost of Supplier(scity) = 50
Total cost: 54
T= 5
T=4 sid = sid
Main memory join
σsstate=‘WA’
Unclustered σpno=2 T= 50
index lookup σscity=‘Seattle’ Unclustered
Supply(pno) index lookup
Supply Supplier Supplier(scity)
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 2
πsname
Cost of Supply(pno) = 4
T=4 Cost of Supplier(scity) = 50
Total cost: 54
T= 5
T=4 sid = sid
Main memory join
σsstate=‘WA’
Unclustered σpno=2 T= 50
index lookup σscity=‘Seattle’ Unclustered
Supply(pno) index lookup
Supply Supplier Supplier(scity)
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 3
πsname
T=4
σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4
Cost of Index join = 4
Total cost: 8
T=4 sid = sid
Clustered
Index join
Unclustered σpno=2
index lookup
Supply(pno)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 3
πsname
T=4
σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4
Cost of Index join = 4
Total cost: 8
T=4 sid = sid
Clustered
Index join
Unclustered σpno=2
index lookup
Supply(pno)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)
Physical Plan 3
πsname
T=4
σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4
Cost of Index join = 4
Total cost: 8
T=4 sid = sid
Clustered
Index join
Unclustered σpno=2
index lookup
Supply(pno)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Discussion
• We considered only IO cost;
real systems need to consider IO+CPU
• Each system has its own hacks
• We assumed that all index pages were in
memory: sometimes we need to add the cost
of fetching index pages from disk
CSEP 544 - Spring 2021 43
Histograms
• T(R), V(R,A) too coarse
• Histogram: separate stats per bucket
• In each bucket store:
– T(bucket)
– V(bucket,A) – optional
CSEP 544 - Spring 2021 44
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Age: 0..20 20..29 30-39 40-49 50-59 > 60
T= 200 800 5000 12000 6500 500
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Age: 0..20 20..29 30-39 40-49 50-59 > 60
T= 200 800 5000 12000 6500 500
Assume V = 10
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Age: 0..20 20..29 30-39 40-49 50-59 > 60
T= 200 800 5000 12000 6500 500
Estimate: 12000/10 = 1200 Assume V = 10
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Age: 0..20 20..29 30-39 40-49 50-59 > 60
T= 200 800 5000 12000 6500 500
V= 3 10 7 6 5 4
Estimate: 12000/10 = 1200
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Age: 0..20 20..29 30-39 40-49 50-59 > 60
T= 200 800 5000 12000 6500 500
V= 3 10 7 6 5 4
Estimate: 12000/10 = 1200 12000/6 = 2000
Types of Histograms
• Eq-Width
• Eq-Depth
• Compressed: store outliers separately
• V-Optimal histograms
CSEP 544 - Spring 2021 52
Employee(ssn, name, age)
Histograms
Eq-width:
Age: 0..20 20..29 30-39 40-49 50-59 > 60
Tuples 200 800 5000 12000 6500 500
Eq-depth:
Age: 0..32 33..41 42-46 47-52 53-58 > 60
Tuples 1800 2000 2100 2200 1900 1800
Compressed: store separately highly frequent values: (48,1900)
V-Optimal Histograms
• Error:
0
# 𝜎*,! 𝑅 − 𝑒𝑠𝑡-'./ 𝜎*,! 𝑅
!∈#$%&'((*)
• Bucket boundaries = argminHist(Error)
• Dynamic programming
• Modern databases systems use V-
optimal histograms or some variations
CSEP 544 - Spring 2021 54
Discussion
• Cardinality estimation = still unsolved
• Histograms:
– Small number of buckets (why?)
– Updated only periodically (why?)
– No 2d histograms (except db2) why?
• Samples:
– Fail for low selectivity estimates, joins
• Cross-join correlation – still unsolved
55