100%(2)100% found this document useful (2 votes) 9K views294 pagesA First Course in Database Systems (3rd Edition)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
TE a ee ee 3
“aRETaRUHE
|
|
CSUR - SBSH
|
A First Course in
Database Systems
:
(Third Edition)
Qaere wweEnglish reprint edition copyright © 2008 by Pearson Education Asia Limited
‘and China Machine Press,
Original English language title: A First Cours
Ei IBN 978.0-13-502176-7) by Jeffrey D.
‘Copyright © 2008, 2002, 1997 Pearson Education, Ine
All rights reserved,
Published by arrangement with the original publisher, Pearson Education, In.,
publishing as Prentice Hall
For sale and distribution in the People’s Republic of China exclusively (except
mng Kong SAR and Macau SAR).
in Database Systems, Third
wan and Jennifer Widom,
FIM) HHT.
Ac {5H Mii Pearson Education
RRA.
RUA. BALE.
TREMP RIERS HF
APHIS. BF. 01-2008-3272
ACHES TA) BOLD D PRE, JE
aK
BUT
8 (Ullman, J. D.),
GFL, 2008.8
st Course in Database Systems, Third Ealition
ISBN 978-7-111-24733.3
1 Di Oph. ede Ee Hea IW. T3113
+ IR 37SENBE
247333
eth: 45.00%
FURGARTS, MA TL T. AOT, A RL TAB
ZRLLNGIALR, (010) 68326294
ARS BIS
KEM MUME, TIRES AE DE Re RE,
ART BITTEN, IEEE
AEH. (HH LEN EA RAAT SAF de AEH BOTS
PS AE Be. ORE RI TORSTEN, a TEAR
SME RM. LAA EE, ROMA T AE A
iit
AEE, (EER ALE AANI HED Fale
PEMA A AER AL
J AL RAR TR A ee A TA PES TE
PRPS OER A RRO ind SRA TSAR, ST RG FREE EBL,
PERLE FRADE Be BCH 5 5 EF BAPE EZ Me
BBR. 1B SRL ERD SP HG LB A AG tL LR
Fae PRA He a PD, i SS eR IEA HR — RE
He
BLT Ae HRA HER a EA HE TR”
LOOREE TERE. SERA HRA T (RACAL CTE Takk | EGUERL SNL Bot
bk. Sit SMASH, FVGPearson, McGraw-Hill, Elsevier,
MIT, John Wiley & Sons, Cengage tJ % #4 time 2s altar 7 RAE
AY Ee PE MA A ATT A A i tak HH Andrew S.
Tanenbaum, Bjarne Stroustrup, Brain W. Kernighan, Dennis Ritchie,
Jim Gray, Afred V. Aho, John E. Hoperoft, Jeffrey D. Ullman,
Abraham Silberschatz, William Stallings, Donald E. Knuth, John L,
Hennessy, Larry L. Peterson Aiti # RASH Se feta. EAE
BLEEP, CREA, IRR. KAY
A, IER TMA BAA A“UE PL BLES A” AY A fe 9 TD hE RET BD BE
ATS Te ALS, ee ALE T
FUERA A «i RASS PE A tA Sek IE HR CE A 4
ECR WIN BR EHES ed, “URL RES” BE
THART AH, RSDRERE tHe TREO, HRS
RRMA ERS MBA. FORCE “AARNE” HE
‘i th He EER
BURMITE RS . SBMAH OM HRI. PATS
Ai, GMETAD RACAL AS i TAI E. Bi Pe LAE Se
AG AE BY AN FE BE ADA A RA, BATE RL
EEL 9G LE A — AB, BLATT FE
iS Ak — A BRA EY, A
MELB THRE, BUNA
hzedu@ hzbook.com Uy
10) 68995264
Fee we AALS oun
tea4@55. 100037 SAH Bats
Preface
At Stanford, we are on the quarter system, and as a re
database instruction is divided into two courses. The firs
for students who will use database systems but not
is prerequisite for C8245, w
to DBMS implementation. Students wishing to go further in the database field
then take CSD45(opecal topics), (S046 (DBMS implementation project), and
ied for CS145, and Database System Impl
"S346. Because many schools are on the semester system
wo kinds of database instructi rductory course,
here was a need to prod asa single volume,
Database Systems: The Co
point somewhat. Today, there are two imp
relational and semistructured (XML). We have decide
object-oriented databases, except in the contexts of
systems,
herefore to downplay
and object-relational
What’s New in the Third Edition
in Chapters
addition to the
moved to Chapter 4 a shorter version of the material on ODL, treatvi PREFACE,
2 design language for relation
treatment of ODL and OQL is a
‘The material on functional and multivalued dependencies has been mod-
ified and remains in Chapter 3. We have changed our viewpoint, 50 that a
fenctional dependency i assumed to have a set of attributes om the right. We
normal form to include the 3NF synthesis algorithm and to make clear what
the tradeoff between 3NF and BCNF is.
Chapter 5 contains the coverage of relational algebra fi
joined by the treatment of Datalog from
recursion in Datalog
chapter, number 8, and hhas been augmented with a discs
important new topics, including materialized views, and automatic selection of
Indexes.
‘The new Chapter 9 is based on the old Chapter 8 (embedded SQL). It
is introduced by a new section on 3-tiered architecture. It also includes an
expanded discussion of JDBC and new coverage of PHP.
Chapter 10 collects a number of advanced SQL topics. The discussion of
authorization from the old Chapter 8 has been moved here, as has the discussion
the old Chapter 10. Most of the chapter is devoted to the
(from the old Chapter 4) and object-relational features
wapters 11 and 12 cover XML and systems based on XML. Ex-
ial at the end of the old Chapter 4, which has bi
ter 12 is devoted to programming, and
and XSLT.
ludes sections on XPath, XQuery,
Use of the Book
‘There is adequate material in this volume for a one-semester course on database
‘modeling and programming. For a one-quarter course, you will probably have
to omit some of the topics. We regard Chapters 2-7 as the core of the course.
‘The remaining five chapters contain material from which it is safe to select at
will, although we believe that every student should get some exposure to the
issues of embedding SQL in standard host languages from one of the sections
in Chapter 9.
PREFACE vii
If, as we do in CS145, you give students a substan fabase-application
‘material such as dependencies, although students need normalization for design,
Prerequisites
‘We have used the book at the
undergraduates and beginnin
oftware systems, software engineer
Exercises
‘The book contains extensive exercises, with some for almost every section, We
indicate harder exercises or parts of exercises with an exclamation point. "The
hhardest exercises have a double exclamation pol
Gradiance On-Line Exercises
‘There is an accompanying set of on-line homeworks using a technology devel-
corks to their
us class” that
get a perfect score.
ce package for the book includes programming ex-
ed for correctness, and
if queries that only respond correctly
iples is not sufficient to got credit for the problem.
Gradiance service can be purchased at
http://www. prenhall..con/goal.
ir classes should contact their
Instructors who want to use the system in
Prentice-Hall representative.vill PREFACE
Support on the World Wide Web
‘The book's home page is
http: //wwy-db. stanford, edu/"ullnan/fedb.htal
‘You will find errata as we learn of them, and backup materials. We are making
available the notes for each offer
‘works, projects and exarns. We
the second edition that have
Mapping the Second Edition to the Third
Here is a table giving the correspondence between old sections and new.
Old_| New || Old_| New Old_| New
Vi [ir fiz fiz 22 [a2
23 [43 [24 [44 33 | 46
34 [31 35 | 32 41 | Web
42 [49 [4s | 49 46 fit
a7 |u2|sa | 22 sa [52
38 |25 jes | oa 64 | 64
65/65 |e | 23 r2 | 72
73 [73 |ira | ras 83 [92
84 jos jas [oo 9.1 | Web
8.2. | web | 9.3 | Web 10.1 | 53
102 {54 | 10.3 | web
Acknowledgements
‘We wonld like o thank Donald Kossmann for helpful discussions, especially con-
cerning XML and its associated programming systems. Also, Bobbie Cochrane
assisted us in understanding trigger semantics for a earlier edi
A large number of people have helped us, either with the development of this
book or its predecessors, or by contacting us with errata in the books and/or
other Web-based materials, It is our pleasure to acknowledge them all here.
H. Adamski, Brad Adelberg, Gleb Ashimo, Don-
Babeock, Bruce Baker, Y
Phillip
Butler, Karen Butler, Mike Carey,
Also Per Christensen, Ed Surajit Chi
Chirkova, Nitin Chopra, Lewis Church, Jr. B
Tom Di
chika, Navier Faz, Greg Fichtenholtz, Bart
Fisher, Simon Frettloeh, Jas!
PREFACE 7
Meredith
Gyn:
Gerard
Luis Gravano,
itzman,
Also Sajid HussaiTable of Contents
‘The Worlds of Database Systems
1.1 The Evolution of Database Systems
Early Database Managenent Stes
11.3 Smaller and Smaller Systems -
1.14 Bigger and Bigger Systems
1.1.5 Information Integration
1.2 Overview of a Databas
121 Data-Defini
1.2.3 Storage and Buffer Management
1.24 Transaction Processing
Relational Database Modeling
‘The Relational Model of Data
2.1 An Overview of Data Models
2.1.1 What is a Data Model?
2.14 The Semistructured Model in Brief .
2.15 Other Data Models
22.1 Attributes
222 Schemas
223° Tuples
224 Domai
2.2.5 Equivalent Representations of a Relation
TABLE OF CONTENTS
2.2.6 Relation Instances
22.7 Keys of Relations . .
228 An Example Database Schema
22.9 Exercises for Section 2.2
23 Defining a Relation Schema in SQL
23.1 Relations in SQL
23,7 Exercises for Section 23. .
24. An Algebraic Query Language
2.4.1 Why Do We Need a Special Query Language? -
24.2 What is an Algebra’
24.3 Overview of Relational Algebra
244 ons on Relations
245 7
246
2A jan Product
24.8 Natural Joins
249
24.12 Relationships Among Operations
24.19 A Liner Netaton for Algsrale Exprendone
24.14 Exercises for Section 2.4
25 Constraints on Relations .
25.1 Relational Algebra.
25.2 Referential Integrity Constraints
25.3 Key Constraints.
254 Additional Constraint Examples.
25.5 Exercises for Section 2.5,
26 Summary of Chapter 2
2.7 References for Chapter 2
Design Theory for Relational Databases
3.1 Functional Dependencies
3.11 Definition of Functional Dependency «
3.1.2 Keys of Rel
Superkeys . .
3.2. Rules About Functional Dependencies .
3.2.1 Reasoni
3.2.2 The Spl
1g/Combining Rule
‘a Constraint Vane
About Functional Dependenciesxii TABLE OF CONTENTS ‘TABLE OF CONTENTS
323 ™ 4 High-Level Database Models
324 Z 4.1 The Entity/Relationship Model
411
325 7 mee
a a Relationships
327 80 Entity-Relationship Diagrams
328 al Instances of an E/R Diagram
329 83 Binary E/R Rel
3.3 Design of Relational Database Schemas 85
33.1 Anomalies eee a6)
3.3.2 Decomposing Relations 86
3.3.3 Boyce-Codd Normal Form sees 88
3.34 Decomposition into BCNF.. . . ‘ sees 89
3.3.5 Exercises for Section 3.3 cee
3.4 Decomposition: The Good, Bad, and Ugly 93
34.1 Recovering Information from a Decomposition 4
34.2 The Chase Test for Lossless Join cee 96
343 Why the Chase Works 7 . 99
344 Dependency Preservation aes 00)
34.5 Exercises for Section 34... . 2 102
3.5 Third Normal Form... 102
35.1 Definition of Third Normal Form 102
35.2 The Synthesis Algorithm for 3NF Schemas 103
3.5.3 Why the 3NF Synthesis Algorithm Works we 14
354 = 105
36 = 105;
nate Independence and Its Consequent. Redundancy 106
of Multivalued Dependencies... . 107
1g About Multivalued Dependencies 108
Fourth Normal Form :
38
391
xiv TABLE OF CONTENTS
4.7.1 UML Classes
472
473
ara §
ars
476 :
277) Aegregation and Conposions
4.7.8 Bxercises for
4.8 From UML Diagram
48.1 UML-to-Relatic
482 From UML Si
4.8.3 From Aggregations and Comp
484 The UML Analog of Week Entity Sts
48.5 Exercises for
Object Definition Li
49.1
49.2 Attributes in ODL
4.9.3 Relationships in ODL
4.9.4 Inverse Relationships
49.5 Multiplicity of Relationships
496
49.7 Subclasses in ODL
4.98 Declaring Keys in ODL
4.9.9 Bxercises for Section 4.9
4.10 From ODL Designs to Relational Designs.
4.10.1 From ODL Classes to Relations . .
4.10.2 Complex Attributes in Classes |...
4.10.3 Representing Set-Valued Attributes |.
4.10.4 Representing Other Type Constructors
10.5. Representing ODL Relationships
4.10.6 Exercises for Section 4.10
4.11 Summary of Chapter 4. .
4.12 References for Chapter 4
4
Relational Database Programming
5 Algebraic and Logical Query Languages
5.1 Relational Operations on Bags
51 Why Bags? 2... .
5.12 Union, Intersection, and Difference of Bags
5.1.3 Projection of Bags
5.14 Selection on Bags
5.15. Product of Bags
5.16 Joins of Bags
172
173
173
175
205
207
209
210
210
TABLE OF CONTENTS
5.17 Exercises for Section 5.1 .
Extended Operators of Relational Algebra
5.2.1 Duplicate Elimination
5.
5.2.2 Aggregation Operators
523 Grouping
5.24 The Grouping Operator . .
5.2.5 Extending the Projection Operator
5.26 The Sorting Operator
Outerjoins -
5.34 Meaning of Datalog Rules
5.3.5 Extensional and Intensional Predi
5.36 Datalog Rules Applied to Bags
5.3.7 Exercises for Section 5.3
5A Relational Algebra and Datalog
5A2 Projection
54.3. Selection
544 Product,
545 Joins
5.4.6 Simulating Multiple Operations with Datalog,
5417 Comparison Between Datlog and Relational Algebra
5A8 Exercises for Section 5.4
5.5 Summary of Chapter 5
5.6 References for Chapter 5
base Language SQL
Nall Values and Comperisons Involving MULL
‘The Truth:
Ordering the Output
Exercises for Section 6.1
6.2 Queries Involving More Than One Relation
6.2.2 Disambiguating Attributes
62.3 Tuple Variablesxvi TABLE OF CONTENTS
624 Interpret
625
mn Queries
Difference of Queries
63.1
63.2
633
634
635
636
637
638
63.9
64. Full-Rel
641
642
643° Grouping and Aggregation in SQL
G44 Agsregation Operators
645
646
647
GAS Exercises for Si
6.5 Database Modifications
6.5.1 Insertion.
652 Deletion
653. Updates
6.5.4 Bxercises for Section 6.5
6.6 ‘Transaction
66.1
662
66.3 Transactions
6.64 Read-Only Transactions
6.65 Dirty Reads . :
6.6.6 Other Is Levels
6.6.7 Exercises for Section 6.6
6.7 Summary of Chapter 6
68 References for Chapter 6
and Differences,
Null
Constraints and Triggers
7.1. Keys and Foreign Keys i
7.1.1 Declaring Foreign-Key Constraints
7.1.2 Maintaining Referential Integrity
7.1.3 Deferred Checking of Constraints
TA Exercises for Section 7.1
7.2. Constraints on Attributes and Tuples
262
265
267
268
269
270
mm
273
274
275
276
27
279
281
281
282
283
284
285
287
288
289
201
291
292
204
295
296
296
298
299
300
302
TABLE OF CONTENTS
73
74
82
83
84
85
7.38 Exercises for Section 7.3,
Assertions
TAL Creating Assertions
TA2_ Using Assertions
743. Exercises for Section 7.4
‘Triggers a
75.1 Triggers in SQL. -
7.5.2 The Options for Trigger Design
7.5.3 Exercises for Section 7.5
Summary of Chapter 7
References for Chapter 7
and Indexes
Virtual Views
BLA Exercises for Section 8.1
Modifying Views
8.21 View Removal
8.2.2 Updatable Views
8.2.3 Instead-Of Triggers on Views
8.2.4 Exercises for Section 8.2
Indexes in SQL
83.1 Motivation for Indexes
83.2 Declaring Indexes
83.3 Exercises for Section 8.3
Selection of Indexes. -
84.1 A Simple Cost Model
34.2 Some Useful Indexes
843 Caleulat
844 Automatic Selection of Indexes to Create
ceeises for Section 84
xvii
319
320
321
323
823
325
325
326
327
328
328
329
30
332
832
334
337
39)
341
341
341
343
343
344
3a
345
345
37
349
350
351
352
352
- 352
2353
355
387
359
359
360
362TABLE OF CONTENTS xix
sli TABLE OF CONTENTS
JDBC aa
m tic Creation of Materialized Views 364 °8 9.6.1 Introduction to JDBC bert
ises for Section 8.5 - 365, 9.6.2 Creating Statements in JDBC . 413.
8.6 Summary of Chapter 8. . . 366 9.6.3 Cursor Operations in JDBC 415
8.7 References for Chapter 8 367, 9.64 Parameter Passing - - fae 416
for Section ria
9 SQU in a Server Environment 360 oo ee +s
9.1 The Three-Tier Architecture = 369 "97.1 PHP Basics aie
1 The Web-Server Tier . . ...370 9.7.2 Arrays - - a8
1.2. The Application Tier 371 9.73 The PEAR DB Library ud
3 The Database Tier i 372 9.74 Creating a Database Connection Using DB 419
92 The SQL Environment . . . 372 Executing SQL Statements 419
921 Environments : 373 ‘Cursor Operations in PHP : : 420
9.2.2 Schemas i 374 Dynamic SQL in PHP ie
9.23 Catalogs 375 9. 78 Exercises for Section 9.7 Lead
824 Clans and Servers in the SOL Envionment 11 375 9.8 Summary of Chapter 9 =
2.5 Connections: ter bead
926. Senora : an ot eeoreer
9.2.7 Modules . 2 878 ‘Advanced Topics in Relational Databases 425,
83 The SQL/Host Language Interface 373 ae peer e gener al 25,
9.3.1 The Impedance Mismatch Problem 380 10. Pr rd
9.3.2 Connecting SQ) lost Language 380 10.1.2 Cree oe
9.3.3 The DECLARE Section - ‘381 10.1.3. The Privilege-Checking Process 428
9.34 Using Shared Variables. . + 382 = fai
9.3.5 Single-Row Select Statements 383 = pee ie
93.6 Cursors Privileg
937 Moen, by Corso 3s Execs or Seton 101 : 7
938 Protecting Against Concurveat Updates 387 squ on
939 Dynamic SQL 338 ing Recursive elations in SQL 437
9.3.10 Exercises for Section 9.3 390, matic ‘Expressions in Recursive SQL 440
94° Stored Procedures 391 ines for Section 10.2 me
941 is and Procedures 391 ie
9.4.2 Some Forms 392, ae
94.3 Branc 394 rae
944 Queri - 3 10.3.3 References - : on
243. Loopetn PSM ee Toad Object Oriented Versus Object Relational 40
946 For-Loops 398, at
94.7 Exceptions in PSM. . 400 cr
402 os
482
405 or
. 407 10.4.5 References bia
9.5.3. Fetching Data From a Query Result 408, 10.4.6 Creating Object ID's for Tables . 455
95.4 Passing Parameters to Queries 10 10A.7 Exercises for Section 10.4 : 7
9.5.5 Exercises for Section 9.5 412Chapter 1
The Worlds of Database
Systems2 CHAPTER 1, THE WORLDS OF DATABASE SYSTEMS
query the data (a “query” is database lingo for
lata) and modify sing an appropriate
‘a query language or data-manipulation language.
language, often
3. Support the storage of very large amounts of data — many terabytes or
over a long period of Jing efficient access to the data
1.1.1 Early Database Management Systems
‘The first commercial database management systems appeared
‘These systems evolved from filesystems, which provide some of
file systems store data over a long period of time, an
lata. However, file systems do not
snerally guarantee that
port efficient access
1. Banking systems: maintaining accounts and making sure that system
failures do not cause money to disappear.
systems: these, like banking systems, require assurance
be lost, and they must accept very large volumes of
small actions by customers,
3. Corporate record keeping: employment and tax records, inventories, sales
records, and a great variety of other types of information, much of it
‘The early DBMS's required the programmer to visualize data much as it
was stored. These database systems used several different data models for
11. THE EVOLUTION OF DATABASE SYSTEMS 3
n a database, chief among them
based model and the graph-based “network” model
sd In the late 1960's through a report of CODASYL
ment, through
able effort. needed to
te such programs, even for very simple q
1.1.2. Relational Database Systems
ng a famous paper written by Ted Codd in 1970,? database systems
changed si Codd proposed that database systems should present
the user with a view of data organized as tables called relations. Behind ¢
scenes, there might be a complex data structure that allowed rapid response
to a variety of queries. But, unlike the programmers for earlier database sys-
toms, the programmer of a relational system would not be concerned with the
storage structure, Queries could be expressed in a very high-level language,
which greatly increased the efficiency of database programmers. We shall cover
the relational model of database systems throughout most of this book, SQL
‘Structured Query Language”), the most important query language based on
relational model, is covered extensively.
By 1990, relational database systems were the norm. Yet the database field
continues to evolve, and new issues and approaches to the manager
surface regularly
Some of the largest databases are organized rather differently from those using,
relational methodology. In the balance of this section, we shall consider soi
of the modern trends in database systems,
1.1.3 Smaller and Smaller Systems
DBMS's were large, expensive software systems run
sm. Today, hundreds of gigabytes fit on a s
in a DBMS on a personal computer. TI4 CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS
serve as a database, and the methods of querying and manipulating them are
different from those relational systems.
1.14 Bigger and Bigger Systems
On the other hand, a gigabyte is not that much data any more. Corporate
databases routinely store terabytes (10? bytes). Yet there are many databases
that store petabytes (10! bytes) of data and serve it all to users. Some impor-
tant examples:
1, Google holds petabytes of data gleaned fro
data is not held in a traditional DBMS, but
optimized for search-engine queries
crawl of the Web. This
specialized structures
Satellites send down petabytes of information for storage in specialized
systems,
3. A picture is actually worth way more than a thousand words, You can
store 1000 words in five or six thousand bytes. Storing a picture typi-
cally takes much more space. Repositories such as Flickr store mi
of pictures and support search of those pictures. Even a database Ii
Amazon's has millions of pictures of products to serve,
‘And ifstill pictures consume space, movies consume much more. An hour
of video requires atleast a gigabyte, Sites such as YouTube hold hundreds
of thousands, or millions, of movies and make them available easly.
5, Peer-to-peer file-sharing systems use large networks of conventional com-
fers to store and distribute data of various kinds. Although each node
the network may only store a few hundred gigabytes, together the
database they embody is enormous.
1.1.5 Information Integration
To a great extent, the old problem of building and ning databases has
become one of information integration: joining the information contained in
‘many related databases into a whole, For example, a large company has many
divisions. Bach di own database of products
ployee records indepe
‘to mean the same thing or the same
term to mean different things. To make matters worse, the existence of legacy
applications using each of these databases makes it almost impossible to scrap
them, ever.
‘Asa result, it has become necessary with increasing frequency to
tures on top of existing databases, with the goal of integrating
ld struc-
information
1.2. OVERVIEW OF A DATABASE MANAGEMENT SYSTEM 5
distributed among. One popular approach is the creation of data ware-
‘many legacy databases is copied periodically,
with the appropriate translation, to a central database. Another approach is
the implementation of a medi idleware,” whose function is to sup-
pport an integrated model of the data of the various databases, while translating
‘between this model and the actual models used by each database.
1.2 Overview of a Database Management
System
at the top,
to the DBMS:
1, Conventional users and application programs that ask for data or modify
data,
A database administrator: a person or persons responsible for the struc-
ture or schema of the database.
1.2.1 Data-Definition Language Commands
‘The second kind of command is the simpler to process, and we show its trail
beginning at the upper right side of Fig. 1.1. For example, the database admin-
‘or, or DBA, for a university registrar's database might decide that there
‘a student, a course the student
hat course. The DBA might also
red by the DBA, who needs special authority to ex-
cute schema-altering commands, since these can have profound effects on the
database. ‘These schema-altering data-definition language (DDL) commands
are parsed by a DDL processor and passed to the execution engine, which then
‘goes through the index/file/record manager to alt ‘metadata, that
schema information for the database.
, the:
1.2.2 Overview of Query Processing
The great majority of interactions llow the path on the left
side of Fig. 1-1. A user or an ay ates some action, using
‘the data-manipulation language (DML). ‘This command does not affect the
schema of the database, but may affect the content of the database (if theCHAPTER 1. THE WORLDS OF DATABASE SYSTEMS
Bxecuton
reo reuc
Indewiteree=
Figure 1.1; Database management system components
OVERVIEW OF A DATABASE MANAGEMENT SYSTEM 7
‘a modification command) or will extract data from the database (ifthe
‘a query). DML statements are handled by two separate subsystems,
Answering the Query
pptimized by a query compiler. The resul
is the DBMS wi
‘The query is parsed
plan, ot sequence of
passed to the engine. The exe
requests for s s of data, typically
resource manager that knows about dat
tnd sie of records in those files, and index fle, which lp fi
data files quickly.
‘The requests for data are pas
ager’s task is to bring appropriate px
(disk) where it
page or “disk block’ is the unit of transfer between
‘The buffer manager communicates with a storage manager to get data
disk. ‘The storage manager migh
more typically, the DBMS issues
to the buffer manager. The buffer man-
‘Transaction Processing
Queries and other DML actions are grouped into transactions, which are
tion of transactions must be durable, meaning that the effect of any completed
transaction must be preserved even if the system fails in some way right after
completion of the transaction, We divide the transaction processor into two
major parts:
1. A concurrency-control manager, or scheduler, responsible for assuring,
atomicity and isolation of transactions, and
2. A logging and recovery manager, responsible for the durability of trans:
actions,
1.2.8 Storage and Buffer Management
‘The data of a database normally resides in secondary storage; in today’s com-
puter systems “secondary storage” generally means magnetic disk. However, to
perform any useful operation on data, that data must be in main memory. Tt
is the job of the storage manager to control the placement of data on disk and
its movement between disk and main memory.
a simple database storage manager might be nothing more
than the file system of the underlying operating system. However, for efficiencyOVERVIEW OF A DATABASE MANAGEMENT SYSTEM 9
8 CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS
purposes, DBMS's normally control storage on the disk directly, at least under The ACID Properties of Transactions
some circumstances. ‘The storage manager keeps track of the location of files
on the disk and obtains the block or blocks containing a file on request from Properly implemented transactions are commonly said to meet the “ACID
the buffer manager. test,” where:
‘The buffer manager is responsible for par
ory into buffers, which are page-sized regions
ing the available main mem-
10 which disk blocks can be © “AY stands for “atomicity,” the all-or-nothing execution of trans-
all DBMS components that need information from the disk actions,
‘© "T” stands for “isolation,” the fact that each transaction must appear
to be executed as if no other transaction is executing at the same
need include:
1. Data: the contents of the database itself stands for “dur
2. Metadata: the database schema that describes the structure of, and con- ee ee
straints on, the database. ‘has completed.
The remaini 7 stands for “consistency.” That is, all databases
2 fay evr, formation abut recent changes to the dati thee Helped panera lence
data elements (e.g., account balances may not be negative after a trans-
died ean gees er bos eae flo hen rast epee opr te cna
Eee erred poe
components of the database.
Indeses: data structures that support efficient access to the data,
sei icc at once, Thus, the scheduler (concurrency-control manager) must assure
Fi that the indi
1.2.4 Transaction Processing iat the net effect is the same as if the transactions had in
i in their entirety, one-at-a-time. A typical scheduler does
normal to group one or more database operations into a transaction, which
is a unit of work that must be executed atomically and in apparent isolation its work by maintaining locks on certain pieces of the database. These
from other transactions, In addition, a DBMS offers the guarantee of durability locks prevent two transactions from accessing the same piece of data in
‘that the work of a completed transaction will never be lost. The transaction ‘ways that interact badly. Locks are generally stored in a main-memory
lock table, as suggested by Fig, 1.1. The scheduler affects the execution of
1er database operations by forbidding the execution er
locked parts of the database.
‘manager therefore accepts transaction commands from an application, which
tell the transaction manager when tran begin and end, as well as infor-
‘mation about the exps (some may not wish to require
atomicity, for example). The transaction processor performs the following tasks:
locks that the scheduler grants, they can get into a situation where not
ccan proceed because each needs something another transaction has. The
transaction manager has the respons :
-vene and cancel (“roll-
back” or “abort”) one or more transactio the others proceed,
1. Logging: In order to assure durat
logged separately on disk. ‘The log
‘designed to assure that no matter wh
a recovery manager will be
change in the database is
sone of several policies
system failure or “crash” occurs,
the log of changes and restore
the database to some consistent, state. writes
the log in buffers and negotiates with the buffer manager to make sure that
buffers are written to disk (where data can survive a crash) at appropriate
1.2.5 The Query Processor
times. ‘The portion of the DBMS that most affects the performance that the user sees
isthe query proceser Ta Fig. 1 the query broweor represented Dy eo
2. Concurrency contro: Transaction must appear to execu in sl i te gery Ye LL the ery vecented by
But in most systems, there will in truth be many transactions executing.10 CHAPTER 1, THE WORLDS OF DATABASE SYSTEMS
ranslates the query into an internal form called
uence of operations to be performed on
‘a query plan are implementations of
1. The query compiler, which
(a) A query parser, which builds a tree structure from the textual form
of the query.
A query preprocessor, which performs seman
(eg, making sure all relat
query pl
best available sequence of operations on the actual data,
's about the data to decide
fastest. For example, the
The query compiler uses metadata and stat
which sequence of operations is likely to be th
access to data, given values for one or more components of that dat
riake one plan much faster than another.
2. The execution engine, which has the responsi
the steps in the chosen query
‘most of the other component
must get the data
accessing data that is locked, and
all database changes are properly logged.
1.3 Outline of Database-System Studies
We divide the study of databases into five parts. This section is an out
each of these units
‘model is essential for a study of database systems. After ex-
amining the basic concepts, we delve into the theory of relational databases.
‘That study includes functional dependencies, a formal way of stating that one
kind of data is uniquely determined by another. Tt also includes normalization,
the process whereby functional dependencies and other formal dependencies are
used to improve the design of a relational database
‘We also consider high-level design notations, These mechanisms include the
Entity-Relationship (E/R) model, Unified Modeling Language (UML), and Ob-
ject Definition Language (ODL). Their purpose isto allow informal exploration
‘of design issues before we implement the design using a relational DBMS.
13. OUTLINE OF DATABASE-SYSTEM STUDIES ul
Part II: Relational Database Programming
We then take up the matter of how relational databases are queried and modi-
fied. After an introduction to abstract programming languages based on algebra
and logic (Relational Algebra and Datalog, respectively), we turn our atten-
tion to the standard language for relational databases: SQL. We study both
the basics and important special topics, including constraint specifications and
triggers (active database elements), indexes and other structures to enhance
performance, forming SQL into transactions, and security and privacy of data
in SQL.
We also discuss how SQL is used in complete systems. It is typical to
combine SQL with a conventional or host language and to pass data between
the database and the conventional program via SQL calls. We discuss a number
of ways to make this connection, including embedded SQL, Persistent Stored
‘Modules (P:
(DBO), and
Part III: Semistructured Data Modeling and Programming
‘The pervasiveness of the Web has put a premium on the management of hierar-
chically structured data, because the standards for the Web are based on nested,
tagged elements (semistructured data). We
defining notations: Document Type Definitions (DTD) and XML Schema. We
also examine three query languages for XML: XPATH, XQuery, and Extensible
Stylesheet Language Transform (XSLT)
Part IV: Database System Implementation
We begin with a study of storage management: how disk-based storage can be
organized to allow efficient access to data. We explain the commonly used B-
a balanced tree of disk blocks and other specialized schemes for managing
mensional data,
‘We then turn our attention to query processing. There are two patts to
this study. First, we need to learn query execution: the algorithms used to
implement the operations from which queries are built. Since data is typically
fon disk, the algorithms are somewhat different from what one would expect
were they to study the same problems but assuming that data were in main
‘memory. The second step is query compiling. Here, we study how to select an
‘efficient query plan from among all the possible ways in which a given query
can be executed
‘Then, we study transaction processing. There are several threads to foliow.
‘One concerns logging: maintaining reliable records of what the DBMS is doing,
in order to allow recovery in the event of a crash. Another thread is scheduling
controlling the order of events in transactions to assure the ACID properties.
We also consider how to deal with deadlocks, and the modificatio
rithms that are needed when a transaction is distributed over many12 CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS
sites.
Part V: Modern Database System Issues
In this part, we take up a number of the ways in which database-system tech-
nology is relevant beyond the realm of conventional, relational DBMS's. We
‘consider how search engines work, and the specialized data structures that make
their operation possible. We look at information integration, and methodolo-
‘gies for making databases share their data seamlessly. Data mining is a study
that includes a number of interesting and important algorithms for processing
large amounts of data in complex ways. Data-stream systems deal with data
arrives at the system continuously, and whose queries are answered contin=
jously and in a timely fashion. Peer-to-peer systems pri
for management of distributed data held by independent
1.4 References for Chapter 1
in our citations, but rather shall mention only the papers of historical impor-
tance and major secondary sources or useful surveys. A searchable index of
je references from many fields.
directory of many indexes relevan
implementations of database systems
widely known are the System R
shaped the database field ar
‘The 2003 “Lowell report’
database-system research and dire
of this type.
1 most recent in a series of reports on
jons. It also has references to earlier reports
than is covered
‘can find more about the th
here from [2] and [8]
y of database
1. $. Abiteboul et al, “The Lowell database research self-assessment,"
ACM 48:5 (2008 18. http: //research.microsoft .com/"gray
‘Noveli/Love1IDetebaseResearchSelfhesessnont.hta
2. S. Abiteboul, R. Hull, and V, Vianu, Foundations of Databases, Addison-
‘Wesley, Reading, MA, 1995,
3. http: //Linwww. ira.uka.de/bibliography/Database
1.4. REFERENCES FOR CHAPTER 1 3
4. M. M. Astrahan et al., “System R: a relational approach to database
management,” ACM Trans. on Database Systems 1:2, pp. 97-137, 1976.
hetp://www. informatik.uni-trier.de/“ley/db/index.btal . A mir~
ror site is found at http: //uww.acn.org/signod/dblp/db/ index.html
M. Stonebraker and J. M. Hellerstein (eds.), Readings in Database Sys-
tems, Morgan-Kaufmann, San Francisco, 1998.
M, Stonebraker, E. Wong, P. Kreps, and G. Held, “The design and imple-
mentation of INGRES,” ACM Trans. on Database Systems 1:3, pp. 189-
222, 1976.
8. J.D. Ullman, Principles of Database and Knowledge-Base Systems, Vol-
umes I and I, Computer Science Press, New York, 1988, 1989.Part I
Relational Database
ModelingChapter 2
The Relational Model of
Data
‘This chapter introduces the most important model of data: the two-dimensional
table, or “relation.” We begin with an overview of data models in general, We
nology for relations and show how the model can be used to
represent typical forms of data. We then introduce a portion of the language
SQL — that part ns and their structure. The chapter
closes with an int algebra. We see how this notation
serves as both a query language — the aspect of a data model that enables us
to ask questions about the data — and as a constraint language — the aspect
of a data model that lets us restrict the data in the database in vatious ways.
2.1 An Overview of Data Models
jon of a “data model” is one of the most fundamental in the study of
systems. In this brief summary of the concept, we define some basic
terminology and mention the most important data models.
2.1.1 What is a Data Model?
A data model is a notation for describing data or information. The description
generally consists of throe parts:
1, Structure of the data, You may be familiar with tools in programming
languages such as C or Java for describing the structure of the data used by
‘program: arrays and structures (“structs”) or objects, for example. The
data structures used to implement data in the computer are sometimes
referred to, in discussions of database systems, as a physical data model,
although in fact they are far removed from the gates and electzons that
truly serve as the physical implementation of the data, In the database
ra8 CHAPTER 2. THE RELATIONAL MODEL OF DATA
world, data models are at a somewhat higher Ie
fand are sometimes referred to as a conceptual
difference in level. We shall see examples shortly.
2. Operations on the de
data are generally anyt!
‘models, there is usually limi
We are generally allowed to perform
that retzieve information) and modifications (op
database). This limi
operati
ata very high level, yet have
the operations efficiently. In cor
‘optimize programs in con
Inefficient algorithm (e
(eg, quicksort.
the data. Database data models us
2.1.2 Important Data Models
‘Today, the two data models of preeminent importance for database systems are
1, The relational model, including object-relational extensions.
including XML and related standards,
2. The semistructured-data model
‘The first, which Is present in all com
is the subject of this chapter. The semistructured model
the primary manifestation, is an added feature of most
appears in a number of other contexts as well. We t
starting in Chapter 11
2.1.8. The Relational Model in Brief
minutes, and the genre of the movie. We show three particular
should imagine that there are many more rows to this table — one row for each
movie ever made, perhaps.
‘The structure portion of the relational model might appear to resemble an
array of structs in C, where the column headers are the field names, and each
2.1, AN OVERVIEW OF DATA MODELS 19
tithe |_vear | length | genre
Gone With the Wind] 1939] 231 | drane
Star Vars 397 | 124 | sciFi
Wayne’s World 1992| 98 | conedy
Figure 2.1: An example relation
of the rows represent the values of one struct in the array. However, it must be
‘emphasized that this physical implementation is only one possible way the table
could be implemented in physical data structures. In fact, it is not the normal
‘way to represent relations, and a large portion of the study of database systems
addresses the right ways to implement such tables. Much of the distinction
comes from the scale of relations — they are not normally implemented as
main-memory structures, and their proper physical implementation must take
into account the need to access relations of very large size that are resident on
disk,
‘The operations normally associated with the relational model form the “re-
lational algebra,” which we discuss beginning in Section 2.4. These operations
are table-oriented. As an example, we can ask for all those rows of a relatio
that have a certain value in a certain column. For example, we can ask of the
table in Fig. 2.1 for all the rows where the genre is “comedy.
‘The constraint portion of the relational data model will be touched upon
briefiy in Section 2.5 and covered in more detail in Chapter 7. However, as a
brief sample of what kinds of constraints are generally used, we could decide
that there is a fixed list of genres for movies, and that the last column of every
row must have a value that is on this list. Or we might decide (incorrectly,
it turns out) that there could never be two movies with the same title, and
constrain the table so that no two rows could have the same string in the fist
‘component.
2.1.4 The Semistructured Model in Brief
Semistructured data resembles trees or graphs, rather than tables or arrays.
The principal manifestation of this viewpoint today is XML, a way to represent
ally nested tagged elements. The tags, similar to those used
might appear in an XML “document” as i
‘The operations on semistructured data usually involve following paths in
the implied tree from an element to one or more of its nested subelements, then
to subelements nested those, and so on, For example, starting at the
the entire document in Fig. 2.2), we might move to
tag Movie? and
ing tag, and from each element to its nested
outer element0 CHAPTER 2, THE RELATIONAL MODEL OF DATA
Movie:
‘title="Gone With the Wind">
>1939
sngth>231
Movie titles"Star Wars">
1977
124
aciFi
Movie title="Vayne’s World">
‘1992
95
conedy
Figure 2.2: Movie data as XML
element, to see which movies belong to the “comedy” genre.
Constraints on the structure of data in this model often involve the data
type of values associated with a tag, For instance, are the values associated
with the tag integers or can they be arbitrary
Fig. 2.2 might be used within
xe be more than one gente for a movie? These and
in Seetion 11.2.
other matters will be taken
2.1.5 Other Data Models
‘There are many other models that are, or have been, associated with DBMS's.
‘A modern trend is to add object-oriented features to the relational model. There
are two effects of object-orientation on relations:
1, Values can have structure, rather than being elementary types such as
integer or strings, as they were in Fig, 2.1
2. Relations can have associated methods.
In a sense, these extensions, called the object-relational mode
the way structs in C were extended to objects in C++, We she
‘object-relational model in Section 10.3.
nalogous to
22, BASICS OF THE RELATIONAL MODEL a
are even database models of the purely object-oriented kind. In these,
longer the principal data-structuring concept, but becomes
‘only one option among many structures, We discuss an object-oriented database
‘model in Section 4.9.
‘There are several other mor
but that have now
is that were used in some ofthe earlier DBMS's,
ality of graphs was built directly into the network model, rather t
trees as these other models do.
2.1.6 Comparison of Modeling Approaches
Even from our brief exat sppears that semistructured models have more
flexibility than relations. This difference becomes even more apparent when
e productivity of programmers who use the data. Surprisin
goals can be achieved with a model, particularly the relational m
1. Provides a simple, limited approach to structuring data, yet is reasonably
versatile, so anything can be modeled.
2. Provides a limited, yet useful,
on of operations on data,
‘Together, these limitations turn into features. They allow us to impleme
languages, such , that enable the programmer to express their wishes at
a very high level. A few lines of SQL can do the work of thousands of lines of
, or hundreds of lines of the eode that had to be written to access dat
earlier models such as network or hierarchical. Yet the short SQL ps
because they use a strongly limited sets of operations, can be optimized to run
as fast, or faster than the code written in alternative languages,
2.2 Basics of the Relational Model
jonal model gives us a single
le called a relation. Figure 2.
y to represent data: as a two-dimen-
which we copy here as Fig. 2, is an
Movies. The rows each represent a2 CHAPTER 2. THE RELATIONAL MODEL OF DATA
‘movie, and the columns each represent a property of movies. In this section,
‘we shall introduce the most important terminology régarding relations, and
illustrate them with the Movies relation.
length | genre
Gone With the Wind | 1989] 281 | arama
Star Wars 1977 | 124 | scaré
Wayne’s World 1992| 95 | comedy
Figure 2.3: The relation Movies
2.2.1 Attributes
‘The columns ofa relation are named
year, length, and genre. AU
ally, an attribute describes the me:
instance, the column with attribute length holds t
each movie.
2.2.2 Schemas
‘The name of a relation and the set of attributes for a relat
schema for that relation. We show the schema for the relation
name followed by a patenthesized list ofits attributes. Thus, the schema for
relation Movies of Fig. 2.3 is
Wovies(title, year, length, genre)
‘The attributes in a relation schema are a set, not a list. However, in order to
talk about relations we often must specify a “standard” order for the attributes.
‘Thus, whenever we introduce a relation schema with a list of attributes, as
above, we shall take this ordering to be the standard order whenever we display
the relation or any ofits rows.
In the relational model, a database consists of one or more relations. The
set of schemas for the relations of a database is called a relational database
‘scherna, or just a database schema.
2.2.3 Tuples
‘The rows of a relation, other than the header row containing the attribute
names, are called ‘A tuple has one component for each attribute of
‘the relation. For instance, the first of the three tuples in Fig. 2.3 has the
four components Gone With the Wind, 1939, 231, and drama for attributes
‘title, year, length, and genre, respectively. When we wish to write a tuple
22, BASICS OF THE RELATIONAL MODEL 2
Conventions for Relations and Attributes
We shall generally follow the convention that relat
capital letter, and att
names begin with a
names begin with a lower-case letter. However,
talk of relations in the abstract, where the names
In that case, we shall use single capital letters
for both relations and attributes, e.g., R(A,B,C) for a generic rel
with three attributes.
in isolation, not as part of a relation, we normally use commas to separate
components, and we use parentheses to surround the tuple, For example,
(Gone With the Wind, 1999, 291, draza)
2.3, Notice that when a tuple appears in isolation, the
10 some indication of the relation to which the tuple
We shall always use the order in which the attributes
wore listed in the relation schema,
2.2.4 Domains
‘The relational model requires that each component of each
‘must be of some elementary type such as integer or
ted for a value to be a record structure, set, list, arra
It is further assumed that. as
domain, that isa partes elementary type. The components of any tuple of
the relation must have, in each ecmpone1
the corresponding column. For ex
schema, We shall do so by
es. For example, we could represe
iding a colon and a
‘schema for the Movies
Novies(title:string, year:integer, length: integer, genre:string)
2.2.5 Equivalent Representations of a Relation
jons are sets of tuples,
tuples of a relation are
three tuples of Fig. 2.3 in
“the same” as Fig, 2.3.
six possible orders, and the4 CHAPTER 2. THE RELATIONAL MODEL OF DATA
Moreover, we can reorder the attributes of the relation as we choose, without
changing the relation. However, when we reorder the relation schema, we must
be careful to remember that the attributes are column headers. Thus, when we
change the order of the attributes, we also change the order of their columns,
When the columns move, the components of tuples change their order as well.
‘The result is that each tuple has its components permuted in the same way as
the attributes are permuted.
For example, Fig, 2.4 shows one
‘ered “the same.” More precisely, these two tables are different presentations of
the same relation,
year
1977 | sciFi | Star Ware 124
1992 | comedy | Wayne 95
1939 | drama | Gone With the Wind | 231
Figure 2.4: Another presentation of the relation Novies
2.2.6 Relation Instances
A relation abo
movies is not static; rather, relations change over time. We
uples for new movies, as these appear. We also expect changes
to existing tuples if we get revised or corrected information about a movie, and
perhaps deletion of tuples for movies that are expelled from the database for
some reason
js less common for the schema of a relation to change. However, there are
situations where we might want to add or delete attributes. Schema changes,
while possible in commercial database systems, can be very expensive, because
cach of perhaps millions of tuples needs to be rewritten to add or delete com-
ponents. Also, if we add an attribute, it may be difficult or even impossible to
appropriate values for the new component in the existing tuples.
e shall calla set of tuples for a given relation an instance of that relation
For example, the three tuples shown in Fig, 2.3 form an instance of relation
Movies. Presumably, the relation Movies has changed over time and will con-
true to change over time. For instance, in 1990, Movies did not contain the
‘tuple for Vayne’s World. However, a conventional database system maintains
aly one version of any relation: the set of tuples that are in the relation “now.”
‘This instance of the relation is called the current instance.)
“Databases chat maintain historical versions of data at it exited in past times are called
temporal databases
2.2. BASICS OF THE RELATIONAL MODEL 2%
2.2.7 Keys of Relations
‘There are many constraints on relations that the relational model allows us to
place on database schemas. We shall defer much of the discussion of constraints
until Chapter 7. However, one kind of constraint is so fundamental that we shall
if we do not allow two tuples in a relation instance to have the same values
all the attributes of the key.
We can declare that the
Example 2,
of the two at
jon Movies has a key consisting
that title by itself does not form a key, since sometimes “remakes” of a movi
appear. For example, there are three movies named King Ke
‘a different year. It should also be obvious that year by its
‘here are usually many movies made in the same year.
dicate the attribute or attributes that form a key for a relation by
if the Key attribute(s), For instance, the Hevies relation could have
its schema written as
Movi
(eitle, year, length, genre)
Remember that the statement that a set of attributes forms a key for a
relation is a statement about all possible instances of the relation, not a state-
‘ment about a single instance, For example, looking only at the tiny relation of
Fig. 2.3, we might imagine that genre by itself forms a key, since we do not see
two tuples that agree on the value oftheir genre components, Hoxever, we can
imagine that if the relat
be many dramas,
tuples that agreed ‘a consequence, it would be
Incorrect to assert that genre is a key for the relation Movies.
While we might be sure that title and year can serve as a key for Novies,
id databaces use artificial keys, doubting that it is safe to make
about the values of attributes outside
from all others,
the employee-ID at
en if there are several employees with the same name. Thus,
te can serve as a key for a relation about employees,
0 serve as a key for employees. Note that there is
key, as there would be for
and Social-Security numbers.
‘hose purpose is to serve as a key is quite
w employve ID's, we find student ID's to distinguish
widespread. In ad26 CHAPTER 2. THE RELATIONAL MODEL OF DATA
students in a university. We find drivers’ license numbers and automobile reg-
istration numbers to distinguish drivers and automobiles, respectively. You
‘undoubtedly can find more examples of attributes created forthe primary pure
pose of serving as keys
Movies
genre:string,
studiotane:string,
producerC#: integer
>
Moviestar
nane:string,
address:etring,
gender:char,
direndave:aate
>
Starsin(
cort#: integer,
netorth: integer
>
Studio(
hage:string,
aadre
presC#: integer
)
Figure 2.5: Example database schema about movies
2.2.8 An Example Database Schema
‘We shall close this seetion with an example of a te database schema,
‘The topic is movies, and it builds on the relation Movies that has appeared so
shown in Fig. 2.5. Here are the things
of this schema,
2.2, BASICS OF THE RELATIONAL MODEL 7
Movies
This relation is an extension of the example rel
so far. Remember that its key is title and year together. We have added
two new attributes; studiolame tells us the studio that owns the movie, and
producerC# is an integer that represents the producer of the movie in a way
that we shall discuss when we talk about the relation MovieExec below.
we have been discussing
MovieStar
‘This relation tells us something about stars. The key is name, the name of the
movie star. It is not usual to assume names of persons are unique and therefore
suitable as a key. However, movie stars are different; one would never take a
name that some other movie star had used. Thus, we shall use the convenient.
tional approach would
-security numbers, so that
wwe could assign each individual a unique number and use that attribute as the
key. We take that a se. Another
interesting point abo two new data
types. The gender can be a si
date,” which might be a character string ofa special frm.
Starein
‘This relation connects movies to the stars of that movie, and likewise connects a
star to the movies in which they appeared. Notice that movies are represented
by the key for Movies — the title and year — although we have chosen differ-
are necessary to form a key. It is pet
Starsin could have two distinct tuples that agree in any two of the three
ibutes. For instance, a star might appear in two movies in one year, giving
rise to two tuples that agreed in movieYear and starNane, but disagreed in
novieTitle.
MovieExec
This relation tells us about movie executives. It contains their name, address,
and networth as data about the
‘and studio presidents (as app
low). These are integers; a different one is assigned to each exec28 CHAPTER 2. THE RELATIONAL MODEL OF DATA
acctNo | type __| balance
12345 | savings | 12000
23486 | checking | 1000
34867 ings | 25
The relation Accounts
frstName | tastName account
Robbie | Banks 901-209 12945
Lena | Hana | 905-299 | 12365,
tena | ana | 905-293 | 23456
‘The relation Customers
Figure 2.6: Two relations of a banking database
Studio
‘This relation tells about movie st
same name, and therefore use ni
address of the stu
‘We assume that the st
appears in NovieExec.
ios. We rely on no two studios having the
as the key. ‘The other at are the
he certificate number for the president of the studio,
president is surely a movie executive and therefore
2.2.9 Exercises for Section 2.2
Exercise 2.2.1
part of a ban|
In Fig. 2.6 are instances:
database. In:
8) The aetributes ofeach relation
by The tuples of each reation
¢) The components of one tuple fom each relation,
4) The relation schema foreach relation,
©) The database schema
£) A suitable domain for each attribute.
48) Another equivalent way to present each relation
23. DEFINING A RELATION SCHEMA IN SQL 29
Exercise 2.2.2: In Section 2.2.7 we suggested that there are many examples
of attributes that are created for the purpose of serving as keys of relations.
Give some additional examples.
Exercise 2.2.3: How many different wi
attributes) are there to represent a relation
considering orders of tuples and
tance if that instance has:
8) Three attributes and three tuples, lke the relation Accounts of Fig. 2.67
) Four attributes and five tuples?
©) mattributes and m tuples?
2.3 Defining a Relation Schema in SQL
‘SQL (pronounced “sequel”) is the principal language used to describe and ma-
te relational databases, There is a current standard for SQL, called SQL-
fost commercial database management systems implement something sim-
jut. not identical to, the standard. There are two aspects to SQL:
1. The Data-Definition sublanguage for declaring database schemas and
2. The Data-Manipulation sublanguage for querying (askiug questions a
out) databases and for modifying the database.
‘The distinction between these two sublangu
eg., C or Java have portions that declar
‘executable code, These correspond to data
respectively.
i section we shall begin a discussion of the data-definition portion
of SQL. There is more on
constraints on date, ‘The
Chapter 6.
languages;
mas that are
anipulation,
2.3.1 Relations in SQL
SQL makes a distinction between three kinds of relations:
1, Stored relations, which are called tables. These are the kind of relation
‘we deal with ordinarily — a relation that exists in the database and that
can be modified by changing its tuples, as well as queried.
Views, which are relations defined by a computation. These relations are
not stored, but are constructed, in whole or in part, when needed. They
are the subject of Section 8.1,30 CHAPTER 2. THE RELATIONAL MODEL OF DATA
‘3, Temporary tables, which are constructed by the SQL language processor
when it performs its job of executing queries and data modifications,
‘These relations are then thrown away and not stored.
shall learn how to declare tables. We do not treat the dee-
n of views here, and temporary tables are never declared,
and their data types. Tt also allows
{or a relation. There are many other
including many forms of constraints
that can be declared, and the declaration of indezes (data structures that speed
‘up many operations on the table) but we shall leave those for the appropriate
time.
2.3.2 Data Types
jive data types that are supported by SQL
ist have a data type.
1, Character strings of fixed or varying length, The type CHAR(n) denotes
a fixed-length string of up to n characters. VARCHAR(A) also denotes a
string of up to n characters. The difference is implementation-dependent;
used. SQL
8 reasonable coercions between values of characterstring types.
Normally, a string is padded by trail
of a component that is
ample, the string "foo"
of type CHAR(
llowing the second o).
varying-length chats
rer than characters
while BIT VARYING(a) de
‘The type BOOLEAN denotes an
gical. The possi-
are TRUE, FALSE, and — although it would
— ONKNOY.
permitted may be less, depending on
the types int and ehort int in C),
2.3, DEFINING A RELATION SCHEMA IN SQL 31
Dates and Times in SQL
Different SQL implementations may
tions for dates and times, but the if is the SQL standard repre-
sentation. A date value is the keyword DATE followed by a quoted string
of a special form. For example, DATE ° 1948-05-14? follows the required
form. The first four characters are digits representing the year. Then come
a hyphen and two digits representing the month. Finally there is another
hyphen and two digits representing the day. Note that single-digit months
and days are padded with a leading 0.
AA time value is the keyword TIME and a quoted string. This string has
n the military (24-hour) clock. Then come a color
two digits for the minute, a ind two digits for the second, If
fractions ofa second are desired, we may continue with a decimal point and.
‘as many significant digits as
represents the time at which all students
many different representa:
5. Floating-point numbers can be represented in a vari
use the type FLOAT or REAL (these are synonyms) for
decimal point assumed to be d positior
is a possible value of type DECTHAL(6
for DECIMAL, although there are possibie implementation-dependent dif
ferences,
6. Dates and times can be represented by the data types DATE and TIME,
2.3.3 Simple Table Declarations
‘The simplest form of declaration of a relation schema consists of the
words CREATE TABLE followed by the rame of th mn and a parent
attribute names and their types.
of
Example 2.2: The relati
declared as in Fig. 27. TI2 CHAPTER 2, THE RELATIONAL MODEL OF DATA
CREATE TABLE Movii
title
yoar
length INT,
genre ‘CHAR (10),
atudiollame CHAR(30),
producerct INT
Figure 2.7: SQL declaration of the table Movies:
‘The year and length attributes are each integers, and the genre is a string of
(up éo) 10 characters, The decision to allow up to 100 characters for a title
is arbitrary, but we don't want to limit the lengths of titles too strongly, or
long titles would be truncated to fit. We have assumed that 10 characters are
enough to represent a genre of movie; again, that is an arbitrary choice, one
wwe could regret if we had a genre with a long name, Likewise, we have chosen
30 characters as sufficient for the studio name. The certificate number for the
produeer of the movie is another integer. 0.
Example 2.8: Figure 2.8 is # SQL declaration ofthe relation NovieStar from
Fig. 2.5, It illustrates some new options for data types. The name of this table
is MovieStar, and it has four attributes. The first two attributes, nane and
address, have each been declared to be character strings. However, with the
name, we have made the decision to use a fixed-length string of 30 characters,
padding a name out with blanks at the end if necessary and truncating a name
to 30 characters if itis longer. In contrast, we have declared addresses to be
variable-length character strings of up to 255 characters It is not lear that
these two choices are the best possible, but we use them to illustrate the two
major kinds of string data types.
CREATE TABLE MovieStar (
‘CHAR(3O) ,
VARCHAR(255) ,
gender CHAR(1),
birthdate DATE
d
Figure 2.8: Declaring the relation schema for the NovieStar relation
strings, however.
28. DEPINING A RELATION SCHEMA IN SQL 33
‘The gender attribute has values that are a single letter, M or F. Thus, we
ccan safely use a single character as the type of this attribute. Finally, the
birthdate attribute naturally deserves the data type DATE.
2.3.4 Modifying Relation Schemas
table, including all of its current
change the schema by adding or deleting attributes.
We can delete a relation R by the SQL statement:
DROP TABLE R;
Relation Ris no longer part of the database schema, and we can no longer
‘access any of its tuples.
‘More frequently than we would drop a relation that is part of a long-lived
database, we may need to modify the schema of an existing relation. These
‘modifications are done by a statement that begins with the keywords ALTER
TABLE and the name of the relation. We then have several options, the most
important of which are
1. ADD followed by an attribute name and its data type.
2. DROP followed by an attribute name.
Example 2.4: Thus, for instance, we could modify the MovieStar relation by
adding an attribute phone with:
ALTER TABLE MovieStar ADD phone CHAR(16) ;
As a result, the MovieStar schema now has five attributes: the four mentioned
in Fig. 2.8 and the attribute phone, which is a fixed-length string of 16 bytes.
In the actual relation, tuples would all have components for phone, but we
know of no phone numbers to put there. Thus, the value of
to the special null value, NULL. In Section 2.
hhow itis possible to choose another “default” value to be used instead of NULL
for unknown values.
As another example, the ALTER TABLE statement:
ALTER TABLE MovieStar DROP birthdate;
deletes the birthdate attribute. As a result, the schema for ovieStar no
longer has that attribute, and all tuples of the current MovieStar instance
hhave the component for birthdate deleted, 2M4 CHAPTER 2. THE RELATIONAL MODEL OF DATA
2.3.5 Default Values
‘When we create or modify tuples, we sometimes do not have values for all
components. For instance, we mentioned in Example 2.4 that when we add a
in to 8 relation schema, the existing tuples do not have a known value, and
it was suggested that NULL could be used in place of a “real” value. However,
there are times when we would prefer to use another choice of default value, the
value that appears in a column if no other value is known,
In general, any place we declare an attribute and its data type, we may add
the keyword DEFAULT and an appropriate value. That value is either NULL or
iat are provided by the system, such as the
et us consider Example 2.3. We might wish to use the char-
acter ? as the default for an unknown gender, and we might also wish to use
the earliest possible date, DATE 1000-00-00" for an unknown birthdate, We
could replace the declarations of gender and birthdate in Fig. 2.8 by:
gender CHAR(1) DEFAULT 17”,
birthdate DATE DEFAULT DATE 0000-00-00"
As another example, we could have declared the default value for new at-
tribute phone to be "unlisted? when we added this attribute in Example 2.4
Tin that case,
ALTER TABLE MovieStar ADD phone CHAR(16) DEFAULT ‘unlisted’;
would be the appropriate ALTER TABLE statement,
2.3.6 Declaring Keys
‘There are two ways to declare an attribute or set of attributes to be @ key in
the CREATE TABLE statement that defines a stored relation.
1. We may declare one attribute to be a key when that attribute is listed in
the relation schema,
2, We may add to the in the schema (which so far
have only been attributes) an additional declaration that says a particular
attribute or set of attributes forms the key.
If the key consists of more than one attribute, we have to use method (2). If
the key is a single attribute, either method may be used.
‘There are two declarations that may be used to indicate keyness:
18) PRIMARY KEY, or
by owzgue,
23. DEFINING A RELAT
IN SCHEMA IN SQL 35
‘The effect of declaring a set of attributes S to be a Key for relation R either
using PRIMARY KEY or UNIQUE is the following:
n R cannot agree on
of them is HULL. Any att
this rule causes the DBM:
IEPRIMARY KEY is used,
value for their compot
ted by the system. NULL Is permitted ifthe set S is declared UNIQUE.
however. A DBMS make make other distinctions between
wishes.
other duplicate values for this attribute,
CREATE TABLE MovieStar (
‘name CHAR(SO) PRIMARY KET,
address VARCIAN(256)
gender CHAR(.) ,
birthdate DATE
Figure 2.0 \g nane the key
vely, we can use a separate definition of the key. ‘The resulting
would look like Fig. 2.10. Again, UNIQUE could replace
PRIMARY KEY, Ol
CREATE TABLE MovieStar (
‘name CHAR(30) ,
address VARCHAR (255),
‘gender CHAR(1)
birthdate DATE
PRIMARY KEY (name)
%
Figure 2.10: A separate declaration of the key36 CHAPTER 2. THE RELATIONAL MODEL OF DATA
Example 2.7: In Example 2.6, the form of either Fig. 2.9 or Fig. 2.10 is
acceptable, because the key is a single attribute. However, in a situation where
the key has more than one attribute, we must use the style of Fig. 2.10. For
instance, the relation Movie, whose key isthe pair of attributes title and year,
must be declared as in Fig. 2.11, However, as usual, UNIQUE is an option to
replace PRIMARY KEY. 0
CREATE TABLE Movies (
title cuAR(100) ,
year nt,
length INT,
genre CHAR(10),
studiolane CHAR(30),
producerc# INT,
PRIMARY KEY (title, year)
Figure 2.11: Making title and year be the key of Movies
Exercises for Section 2.3
2.8.1: In this exercise wo introduce one of our running examples of
tabase schema. The database schema consists of four relations,
‘whose schemas are
Product (maker, model, type)
PO(model, speed, ram, hd, price)
Laptop(nodel, speed, ram, hd, screen, price)
Printer (model, color, type, price)
The Product
laptop, or print
imbers are uniq
is not realistic, and a real
as part of the model number. The
that is a PC the speed (of the processo
megabytes), the size of the hard dish
relation is si iches) is also included. The
Printer relation records for each printer model whether the printer produces
color output (true, if $0), the process type (laser or ink-jet, typically), and the
price.
ves the manufacturer, model number and type (PC,
wus products. We assume for convenience that model
manufacturers and product types;
the price. The Laptop
the following declarations:
4) A suitable schema for relation Product.
2.3. DEFINING A RELATION SCHEMA IN SQL a7
») A suitable schema for relation Pc.
©) A suitable schema for relation Laptop.
4) A suitable schema for relation Printer.
) An alteration to your Printer schema from (4) to delete the attribute
color.
f) An alteration to your Laptop schema from (c) to add the attribute od
(optical-disk type, eg., ed or dvd). Let the default value for this attribute
bbe ‘none? if the laptop does not have an optical disk
Exercise 2.3.
World War II capital ships.
‘roduces another running example, concerning.
lves the following relations:
Classes(class, type, country, nunGuns, bore, displacement)
Ships(name, class, launched)
date)
‘ship, battle, result)
‘Ships ate built in “classes” from the same design, and the class js usually named
of that class. The records the name of the
yb? for battleship or 7
‘main guns, and the displacement
1e name of the ship, the name
was launched. Relation Battles gives the m:
8, and relation Outcones gives the
each battle.
ing declarations:
ble schema for relation Classes.
bb) A suitable schema for relation Ships.
©) A suitable schema for relation Battles.
4) A suitable schema for relation Outcomes.
) An alteration to your Clas:
bore.
relation from (a) to delete the attribute
) An alteration to your Ships relation from (b) to include the attribute
‘yard giving the shipyard where the ship was built.38 CHAPTER 2. THE RELATIONAL MODEL OF DATA
2.4 An Algebraic Query Language
In this section, we introduce the data- lation aspect of the relational
icture; it needs a way to query
f operations on relations,
algebra, that
jons from given
‘When the given relations are stored data, then the constructed relations ean be
answers to queries about this data.
Relational algebra is not used today as a query language in commercial
DBMS's, although some of the early prototypes did use this algebra directly
Rather, the “real” query language, SQL, incorporates relational algebra at its
center, and many SQL programs are really “syntactically sugared” expressions
of relational algebra. Further. when a DBMS processes queries, the first thing
that happens to a SQL query is that it gets translated into relational algebra.
or a very similar internal representation. ‘Thus, there are several good reasons
to start out learning this algebra,
2.4.1 Why Do We Need a Special Query Language?
Before introducing the operations of relational algebra, one should ask why, or
‘whether, we need a new kind of programming languages for databases, Won't
conventional languages like C or Java suffice to ask and answer any computable
about relations? After all, we can represent a tuple of a relation by @
in C) oF an object (in Java), and we can represent relations by arrays
of these elements
‘The surprising answer is that relational algebra is useful because it is less
powerful than C or Java. That is, there are computation:
odd. By limiting what we can say or do in our query language, we get two huge
rewards — ease of programming and the abil
highly optimized code — that we discussed
2.4.2 What is an Algebra?
general, consists of operators and atomic operands. For in-
algebra of arithmetic, the atomic operands are variables
15. The operators are the usual arithmetic ones: addition,
and division. Any algebra allows us to build
perators to atomic operands and/or other expresio
of the algebra. Usually, parentheses are needed to group operators and theit
operands, For instance, in arithmetic we have expressions such as (2 + y) +2 of
((2+7)/(y-3)) +2.
24, AN ALGEBRAIC QUERY LANGUAGE 39
Relational algebra is another example of an algebra. Its atomic operands
1. Variables that stand for relations.
2. Constants, which are finite relations.
We shall next see the operators of relational algebra,
2.4.3 Overview of Relational Algebra
‘The operations of the traditional relational algebra fall into four broad classes:
4a) The usual set operations — union, intersection, and difference ~~ applied
to relations.
b) Operations that remove parts of a relation: “selection” eliminates some
rows (tuples), and “projection” eliminates some columns,
luding “Cartesian
possible ways, and
ly pair tuples from two
©) Operations that combine the tuples of two rel
product,” which pairs the tuples of two rel
various kinds of “joii” operations, which select
relations.
4) Am operation called “renaming” that does not affect the tuples of a re-
lation, but changes the relation schema, ie., the names of the attributes
and/or the name of the relation itself.
We generally shall refer to expressions of relational algebra as queries.
2.4.4 Set Operations on Relations
‘The three most common operations on sets are union, intersection, and differ
‘ence. We assume the reader is familiar with these operations, which are defined
as follows on arbitrary sets R and 5:
‘* RUS, the union of R and S. is the set of elements that are in Ror $ or
both. An element appears only once in the union even present in
both Rand S.
© RNS, the intersection of R and S, is the set of elements that are in both
Rand S.
© R~S, the difference of R and S, is the set of elements that are in R but
not in S. Note that ~ 5 is different from 5 — R; the latter is the set of
elements that are in S$ but not in R.
When we apply these operations to relations, we need to put some conditions
on Rand S:40 CHAPTER 2, THE RELATIONAL MODEL OF DATA
1. Rand S must have schemas with identical sets of and the
types (domains) for each attribute must be the same in R and S.
2. Before we compute the set-theoretic union, intersection, or difference of
sets of tuples, the columns of R and S must be ordered so that the order
of attributes is the same for both relations.
Sometimes we woul
relations that have the same
but that use different na
‘operator to be discusse:
relations and give them
with corresponding domains,
tes. If $0, we may use the renaming
to change the schema of one or both
attributes.
gender | birthdate
F 9/9/99
Carrie Fisher | 123 vaple St., Hollywood
Mark Hamill | 456 Oak Rd., Brentwood |m | o/@/ee
Relation R
name address | gender | birthdate
Carrie Fisher | 123 Maple St., Hollywood [F [9/9/09
Harrison Ford | 789 Palm Dr., Beverly Hilts | | 7/7/77
Relation $
Figure 2.12: Two relations
‘Example 2.8: Suppose we have the two relations R and S, whose schemas
are both that of relation MovieStar Section 2.2.8. Current instances of Rand
S are shown in Fig. 2.12. Then the union RU
gender | birthdate
Carrie Fisher | 123 Maple St., Hollywood | F 9/9/99
Mark Haaill | 456 Oak Ra., Brentwood " 3/8/88
Harrison Ford | 789 Paim Dr., Beverly Hille | wrt
Note that the two tuples for Carrie Fisher from the two relations appear of
he result
‘The intersection RA S is
name address gender | birthdate
723 Waple St., Hollywood | F
ly the Carrie Fisher tuple appears, because only itis in both relations.
ifference R— Sis
24, AN ALGEBRAIC QUERY LANGUAGE a
gender | birthdate
Wark Haniii | 456 Oak Rd, Brentwood | M 3787/88
Rand thus are candidates for
Sand so is not in R-S. 0
‘That is, the Fisher and Hamil
RS. However, the Fisher
tuples appeat
le also appes
2.4.5 Projection
‘The projection operator is used to produce from a relation R a new
that has only some of R's columns. The value of expression r,,43
a relation that has only the columns for attributes A,,Az,... dn of R. The
schema for the resulting value is the set of attributes (4), 42,.-- An}, which
wwe conventionally show in the order listed
title | wear | tength | genre
Sear Ware 1977 | 198 | eciF
Galaxy Quest | 1999 | s08
Wayne's Worle | 1992 | 95
srtengtn Hovis)
‘The resulting relat
year _{ length
Galary Quest | 1999 | 104
Wayne's World | 1992 | 95
As another example, we can project onto the attribute genre with the ex-
pression tyenre(Movies). The result is the single-column relation
genre
aciFi
comedy
"Notice that there are only two tuples in the resulting relator
tuples of Fig. 2.13 have the same value in th
and in the relational algebra of set, duplicate2 CHAPTER 2. THE RELATIONAL MODEL OF DATA
A Note About Data Quality
While we have endeavored to make example data as accurate as possible,
‘we have used bogus values for addresses and other personal information
about movie stars, in order to protect the privacy of members of the acting
profession, many of whom are shy individuals who shun publicity
2.4.6 Selection
‘The selection operator, applied to a relation R, produces a new relation with a
subset of R's tuples. The tuples in the resulting relation are those that satisfy
some condition C that involves the attributes of R. We denote this operation
relation is the same as R's schema, and
1es in the same order as we use for R.
ional expression of the type we are familiar from
onal programming languag¢ onal expressions
eyword if in programming languag as C or Java. Th
js that the operands in condition C are
of R. We apply C to each tuple ¢ of R by subst
appearing in condition C, the component of t fo
tuting for each attribute of C the condition C
that appear
Example 2.10: Let the
tioName | producerO#
seiFi | Fox 12545
| conedy | Dreaavorks | 67890
‘The frst tuple satisfies the condition length > 100 because when we substitute
for length the value 124 found in the component of the fist tuple for attribute
Length, the condition becomes 124 > 100. The latter condition is true, so we
accept the first tuple. The same argument explains why the second tuple of
Fig. 2.13 is in the
The Length component 95. Thus, when we substitute for
length we get the condition 95 > 100, which is false. Hence the last tuple of
Fig. 2.13 is not in the result.
Example 2.11: Suppose we want the set of tuples in the relation Movies that
represent Fox movies at
‘2 more complicated cor
expression is
tength>100 AND studioNamen "Fox? (MOViCS)
24, AN ALGEBRAIC QUERY LANGUAGE 43
‘The tuple
title year | length | inColor | studioName | producerC#
‘Sear ware | 1977 | 124 | true | Fox 12345
is the only one in the resulting relation,
2.4.7 Cartesian Product
‘The Cartesian product (or cross-product, or just praduet) of two sets Rand
S is the set of pairs that can be formed by choosing the first element of the
pair to be any element of Rand the second any element of S. This product
Is denoted Rx S. When R and $ are relations, the product is essentially the
saine. However, since the members of R and $ are tuples, usually consisting
of more than one componé
from $ is a longer tuple,
‘constituent tuples. By convent
precede the components from Sin order for the result.
‘The relation schema for the resulting relation is the union of the schemas
Rand § should happen to have some attributes in
‘common, then we need to invent new names for at least one of each pair of
‘To disambiguate an attribute A that is in the schemas of
use R.A for the attribute from R and S.A for the at
yonent for each of the components of
che components from FR (the left operand)
Example 2.12: For conciseness, let us use an abstract example that illustrates
the product ope relations R and S have the schemas and tuples
in Fig. 2.14 ‘Then the product R x § consists of the six
shown in Fig. 2.14(c). Note how we have paired each of the two tuples of
2.4.8 Natural Joins
More often than we want to take the product of two relations, we find a need to
join them by pairing only those tuples that match in some way. The simplest
Sort of match is the natural join of two relations R and S, denoted R ve S, in
which we pair only those tuples from R and $ that agree in whatever attributes
are common to the schemas of R and $. More preci Ais Any »An be
all the attributes that are in both the schema of R and the schema of S. Then
tuple r from R and a tuple s from S are successfully paired if and only if r
‘and s agree on each of the attributes Ay, Any... Ap
If the tuples r and # are successfully paired in the join Roe S, then the
result of the pairing is a tuple, called the joined tuple, with one component for
‘each of the attributes in the union of the schemas of and S$. The joined tuple“4 CHAPTER 2. THE RELATIONAL MODEL OF DATA
Alp
T]2
a4
(@) Relation R
Bic|p
2/5 6
a}7 fa
9 | 10 | 1
(b) Relation
RB|SB|C|D
rire
ai7 fe
1[2 |o Jrofas
ala |2 |s |e
ala ja |7 |e
3}4 [9 ||
(o) Result Rx S
Figure 2.14: Two relations and their Cartesian product
agrees with tuple r in each attribute in the schema of R, and it agrees with
sin each the schema of S. Since r and s are successfully paired,
ble to agree with both these tuples on the attri
‘The construction of the joined
However, the order of the attributes need not be that convenient; the a
of Rand $ can appear in any order.
Example 2.13: The natural join of the relations A and S$ from Fig. 21
and (b) is
mmmon to Rand S is B. Thus, to pair succes
n their B components. Ifo, the resulting
ponents for attributes A (from R), B (Grom ether R ot §),C'(
(from $)
ly, tuples
le has com-
S), and D
24. AN ALGEBRAIC QUERY LANGUAGE 45
R
s
joined wple
Figure 2.15: Joining tuples
In this example, the first tuple of R successfully pairs with only the first
tuple of ; they share the value 2 on their common attribute B. This pairing
yields the first tuple of the result: (1,2,5,6). ‘The second tuple of R pairs
successfully only with the second tuple of S, and the pairing yields (3,4,7,).
Note that the third tuple of $ does not pair with any tuple of F and thus has
no effect on the result of Rea S. A tuple that fails to pair with any tuple of
the other relation in a join is said to be a dangling tuple,
Example 2.14: The previous example does not illustrate all the possibi
inherent in the natural join operator. For example,
’h more than one tuple, and there was only one attri
‘wo relation schemas. In Fig, 2.16 we see two other relat
share two attributes between their schemas: B and C. We also show an instance
in which one tuple joins with several tuples,
For tuples to pair successfully, they must agree in both the B and C’ com-
ponents. Thus, the first tuple of U' joins with the first two tuples of V, while
the second and third tuples of U join with the third tuple of V. The result of
‘these four pairings is shown in Fig. 2.16(
2.4.9 Theta-Joins
cother basis. For that purpose, we have a related not
join. Historically, the “theta” refers to an arbitrary co
Tepresent by C rather than 8.
‘The notation for a theta-join of relations R and $ based on condition C is
Roag S. The result of this operation is constructed as follows:
1, Take the product of R and 5.
2, Select from the product only those tuples that satisfy the condition C.46 CHAPTER 2. THE RELATIONAL MODEL OF DATA
cory.
Tae
ee ala
(@) Relation U
lb
sa tof
ee ala
0
(b) Relation V
Bic
2/3
aia
7\a
7I\8
(©) Result Use ¥
Figure 2.16: Natural join of relations
‘As with the product operation, the schema for the result is the union of the
schemas of R and S, with “R." or “S8." prefixed to attributes if necessary to
indicate from which Schema the attribute came
Example 2.15: Consider the operation Us acp V, where U and V are the
relations from Fig. 2.16(a) and (b). We must consider all nine pairs of tuples,
, and see whether the A component from the U-tuple
is less than the D component of the V-tuple. The first tuple of U, with an A
hh each of the tuples from V. However, the
.ples, constructed from the five successful pairings. This relation
ig. 217. 0
Notice that the schema for the result in Fig. 2.17 consists of all six attributes,
with U and V prefixed to their respective occurrences of at
distinguish them. Thus, the theta-join contrasts with natus
latter common attributes are merge
24, AN ALGEBRAIC QUERY LANGUAGE a
UB LUC {VB IVC |D
2 |3 ]2 13 14
2 |3 |2 5
2 |3 |r 10
fee) Gaal, 10
7 Ja |r 10
Figure 2.17: Result of U > aco V
do so in the case of the natural join, since tuples don’t pair unless they agree in
‘heir common attributes. In the case of a theta-join, there is no guarantee that
‘compared attributes will agree in the result, si
with =
Example 2.16: Here is a theta-join on the same relations U and V that has
‘4 more complex condition:
Vedac a usa. V
‘That is, we require for successful pairing not only that the A component of the
U-tuple be less than the D component of the V-tuple, but that the two tuples
disagree on their respective B components. The tuple
A|UB|UC|VB|VC|D
TT? 13 |? 1s |1
is the only one to satisfy both conditions, so this relation is the result of the
theta-join above. 0
2.4.10 Combining Operations to Form Queries
If all we could do was to write single operations on one or two relations as
‘queries, then relational algebra would not be nearly as useful as it is. However,
relational algebra, lke all algebras, allows us to form expressions of arbitrary
complexity by applying operations to the result of other operations.
‘One can construct expressions of relational algebra
to subexpressions, using parentheses when necessary to
operands. It is
latter often are easier for us to read, although they are less convenient as a
machine-readable notation,
Suppose we want to know, from our running Movies relation,
itles and years of movies made by Fox that are at least 100
* One way to compute the answer to this query is
1. Sclect those Movies tuples that have length > 100.48 CHAPTER 2, THE RELATIONAL MODEL OF DATA
2. Select those Movies tuples that have studioName = ’Fox?
3. Compute the intersection of (1) and (2).
4. Project the relation from (3) onto attributes title and year.
i
2
engi >= 100 rox!
“T
Movies Movies
Figure 2.18: Expression tree for a relational algebra expression
In Fig. 2.18 we see the above steps represented as an expression tree. Ex:
pression trees are evaluated bottom-up by the operator at an interior
‘The two selection nodes correspond to step
corresponds to step (3), and the projection
Alternatively, we could represent the same express
linear notation, with parentheses. The formula
in a conventional,
Reateseor (2 (Hovies) M Syudionamen'Fox? (Hovies))
represents the same expression
Incidentally, ther is often more than one relational algebra expression that
represents the same computation. For instance, the above query could also be
written by replacing the intersection by logical AND within a single selection
aperation. That i,
1+ (Stengin2100 AND studioNamen"Fox? (Novies))
is an equivalent form of the query. 0
24. AN ALGEBRAIC QUERY LANGUAGE 49
Equivalent Expressions and Query Optimization
All database systems have a query-answering system, and many of them
are based on a language that is similar in expressive power to relational
algebra. Thus, the query asked by a user may have many equivalent e2-
pressions (expressions that produce the same answer whenever they are
given the same relations as operands), and some of these may be much
‘more quickly evaluated. An important job of the query “optimizer” dis-
‘cussed briefly in Section 1.2.5 is to replace one expression of relational
algebra by an equivalent expression that is more efficiently evaluated,
2.4.11 Naming and Renaming
is often convenient to
shall use the operator
oreover, the at-
ofthe result relation Sate named Ay, a,--. Any in order from the
og peer tnt nn ere rere are
attributes an they ae in Ry we can jst sey ps(P)
and (b) and used the convention that when an attribute
appears in both operands, it is renamed by prefixing the relation name to it.
Suppose, however, that we do not wish to call the two versions of B by names
RB and S.B; rather we want to continue to use the name B for the attribute
that comes from R, and we want to use X as the name of the attribute B
coming from S. We can rename the attributes of S so the fist is called X. The
is a relation named § that looks just like
first column has attribute X instead of B.
Aleixic |p
result of the expression ps
the relation $ from Fig. 2.
B
TT2]2
aj2]a
1}2]o9|1o}a
alal2
aja la
3|4]o [10] a
Figure 2.19: R x pscx,c,p)(S)50 CHAPTER 2. THE RELATIONAL MODEL OF DATA
When we take the product of R with this new relation, there is no conflict
the
of names among the attributes, so no further renaming is done. That
result of the expression Rx psix,c,p)(S) is the relation 2x S from Fig. 2
‘except that the five columns are labeled A, B, X, C, and D, from the let. This
relation is shown in Fig. 2.19.
‘As an altemnative, we could take the product without renaming, as we did
in Example 2.12, and then rename the result. The expression
prs(a.e.x.c.0)(R x $)
yields the same relation as in Fig. 2.19, with the same set of attributes. But
this relation has a name, RS, while the result relation in Fig. 2.19 has no name.
0
2.4.12 Relationships Among Operations
‘Some of the operations that we have described in Section 2.4 can be expressed
in terms of other relational-algebra operations. For example, intersection can
be expressed in terms of set difference:
RNS=R-(R-S)
‘That is, if R and $ are any two relations with the same schema, the intersection
of R and S can be computed by first subtracting S from R to form a relation
T consisting of all those tuples in R but not S. We then subtract T from R,
leaving only those tuples of F that are also in S.
The two forms of join are also expressible in terms of other operations.
‘Theta-join can be expressed by product and selection:
Roag $= a0(Rx S)
‘The natural join of R and 5 can be expressed by starting with the product
Fx S. We then apply the selection operator with a condition C of the form
RA = S.Ay AND R.Ap = S.Ap AND: --AND Ry = S.A
where Ay, A;
and S. Fins
An ate all the attributes appearing in the schemas of both R
/, we must project out one copy of each of the equated attributes
Rea =n1(2o(R x5)
Example 2.19: The natural join of the relations U and V from Fig. 2.16 can
be written in terms of product, selection, and projection as:
raveuco(sunev.s ax ucevclY x)
24, AN ALGEBRAIC QUERY LANGUAGE a
‘That i, we take the product Ux V. Then we select for equality between each
pair of attributes with the same name — B and C in this example. Finally,
wwe project onto all the attributes except one of the B’s and one of the C's; we
have chosen to climinate the attributes of V whose names also appear in the
schema of U.
For another example, the theta-join of Example 2.16 can be written
aco an u.pev.n(U xV)
is, we take the product of the relations U7 and V and then apply the
condition that appeared in the theta-join.
‘The rewriting rules mentioned in this section are the only “redundancies”
‘among the operations that remaining operations —
union, difference, selection, » product, and renaming — form an in-
dependent set, none of which can be written in terms of the other five.
2.4.13 A Linear Notation for Algebraic Expressions
In Section 2.4.10 we used an expression tree to represent @ complex expression
of relational algebra. An alternative is to invent names for the temporary
relations that correspond to the interior nodes of the tree and write a sequence
‘of assignments that create a value for each. The order of the assignments is
flexible, as long as the children of a node N have had their values created before
wwe attempt to create the value for NV itself.
‘The notation we shall use for assignment statements is:
1. A relation name and parenthesized list of attributes for that relation. The
‘name Answer will be used conv for the result of the final step;
ice, the name of the relation at the root of the expression tree.
‘The assignment symbol
cn the right. We can choose to use only one
in which case each interior node of the tree gets
is also permissible to combine
if it is convenient to do so.
Any algebraic express
‘operator per assignin
its own assignment statement. Howeve
several algebraic operations in one right side
Example 2.20: Consider the tree of Fig. 2.18. One possible sequence of as-
signments to evaluate this expression is:
= Ctength> 100 (Movies)52 CHAPTER 2. THE RELATIONAL MODEL OF DATA
step computes the relation of the interior node labeled ovengehi00 in
, and the second step computes the node labeled otudictamew For
iat we get renaming “for free,” since we can use any attributes and
relation name we wish for the left side of an assignment. The last two steps
compute the intersection and the projection in the obvious way.
tis also permissible to combine some of the steps. For instance, we could
‘combine the last two steps and write:
= Siength2100 (Movies)
sexs (Movies)
ns)
We could even substitute for R and S in the last line and write the entire
expression in one line.
2.4.14 Exercises for Section 2.4
Exercise 2.4.1: This exercise builds upon the products schema of Exercise
2.3.1. Recall that the database schema consists of four relations, whose schemas
Product (maker, model, type)
speed, ran, hd, price)
‘speed, ram, hd, screen, price)
Printer (model, color, type, price)
Some sample data for the relation Product is shown in Fig. 2.20. Sample
data for the other three relations is shown in Fig. 2.21. Manufacturers and
‘model numbers have been “sanitized,” but the data is typical of products on
sale at the beginning of 2007.
‘Write expressions of relational algebra to answer the following queries. You
‘may use the linear notation of Section 2.4.13 if you wish. For the data of Figs.
2.20 and 2.21, show the result of your query. However, your answer should work
for arbitrary data, not just the data of these figures.
a) What PC models have a speed of at least 3.007
) Which manufacturers make laptops with a hard disk of at least 10GB?
©) Find the model number and price ofall products (of any type) made by
manufacturer B.
4) Find the model numbers of all color laser printers.
¢) Find those manufacturers that sell Laptops, but not PC's.
1!) Find those hard-disk sizes that occur in two or more PC's.
4. AN ALGEBRAIC QUERY LANGUAGE 53
maker | model | type
1001 | pe
1002 | pe
1003 | pe
2004 | Laptop
2005 | Laptop
2006 | laptop
1004 | pe
1005 | pe
1008 | pe
2007 | laptop
1007 | pe
1008 | pe
1009 | pe
1010 | pe
3004 | printer
3005 | printer
1011 | pe
pe
1013 | pe
2001 | laptop
2002 | laptop
2003 | laptop
3001 | printer
3002 | printer
3003 | printer
2008 | laptop
2009 | laptop
2010 | 1aptop
3006
3007
Figure 2.20: Sample data for Product