Introduction to Database Systems Concepts
Introduction to Database Systems Concepts
1
member of a team, or to work individually, in the areas of database application
development or administration. In addition, some coverage of current research
areas is provided, partly as a stimulus for possible future dissertation topics,
and also to provide an awareness of possible future developments within the
database arena.
Module objectives
At the end of this module you will have acquired practical and theoretical knowl-
edge and skills relating to modern database systems. The module is designed
so that this knowledge will be applicable across a wide variety of database en-
vironments. At the end of the module you will be able to:
• Understand and explain the key ideas underlying database systems and
the database approach to information storage and manipulation.
• Design and implement database applications.
• Carry out actions to improve the performance of existing database appli-
cations.
• Understand the issues involved in providing multiple users concurrent ac-
cess to database systems.
• Be able to design adequate backup, recovery and security measures for
a database installation, and understand the facilities provided by typical
database systems to support these tasks.
• Understand the types of tasks involved in database administration and
the facilities provided in a typical database system to support these tasks.
• Be able to describe the issues and objectives in a range of areas of contem-
porary database research.
Chapter objectives
2
Introduction
In parallel with this chapter, you should read Chapter 1 and Chapter 2 of
Thomas Connolly and Carolyn Begg, “Database Systems A Practical Approach
to Design, Implementation, and Management”, (5th edn.).
This chapter sets the scene for all of the forthcoming chapters of the module.
We begin by examining the approach to storing and processing data that was
used before the arrival of database systems, and that is still appropriate today
in certain situations (which will be explained). We then go on to examine the
difference between this traditional, file-based approach to data storage, and that
of the database approach. We do this first by examining inherent limitations of
the file-based approach, and then discuss ways in which the database approach
can be used to overcome these limitations.
A particular model of database systems, known as the Relational model, has
been the dominant approach in the database industry since the early ’80s. There
are now important rivals and extensions to the Relational model, which will be
examined in later chapters, but the Relational model remains the core technol-
ogy on which the database industry worldwide is based, and for this reason this
model will be central to the entire module.
3
If there are a small number of records to be kept, and these do not need to be
changed very often, a card index might be all that is required. However, where
there is a high volume of data, and a need to manipulate this data on a regular
basis, a computer-based solution will often be chosen. This might sound like a
simple solution, but there are a number of different approaches that could be
taken.
The term ‘file-based approach’ refers to the situation where data is stored in one
or more separate computer files defined and managed by different application
programs. Typically, for example, the details of customers may be stored in
one file, orders in another, etc. Computer programs access the stored files to
perform the various tasks required by the business. Each program, or some-
times a related set of programs, is called a computer application. For example,
all of the programs associated with processing customers’ orders are referred to
as the order processing application. The file-based approach might have appli-
cation programs that deal with purchase orders, invoices, sales and marketing,
suppliers, customers, employees, and so on.
Limitations
• Data duplication: Each program stores its own separate files. If the same
data is to be accessed by different programs, then each program must store
its own copy of the same data.
• Data inconsistency: If the data is kept in different files, there could be
problems when an item of data needs updating, as it will need to be
updated in all the relevant files; if this is not done, the data will be incon-
sistent, and this could lead to errors.
• Difficult to implement data security: Data is stored in different files by
different application programs. This makes it difficult and expensive to
implement organisation-wide security procedures on the data.
The following diagram shows how different applications will each have their own
copy of the files they need in order to carry out the activities for which they are
responsible:
4
The shared file approach
One approach to solving the problem of each application having its own set
of files is to share files between different applications. This will alleviate the
problem of duplication and inconsistent data between different applications, and
is illustrated in the diagram below:
The introduction of shared files solves the problem of duplication and inconsis-
tent data across different versions of the same file held by different departments,
but other problems may emerge, including:
• File incompatibility: When each department had its own version of a file
for processing, each department could ensure that the structure of the
file suited their specific application. If departments have to share files,
the file structure that suits one department might not suit another. For
5
example, data might need to be sorted in a different sequence for different
applications (for instance, customer details could be stored in alphabetical
order, or numerical order, or ascending or descending order of customer
number).
• Difficult to control access: Some applications may require access to more
data than others; for instance, a credit control application will need access
to customer credit limit information, whereas a delivery note printing
application will only need access to customer name and address details.
The file will still need to contain the additional information to support
the application that requires it.
• Physical data dependence: If the structure of the data file needs to be
changed in some way (for example, to reflect a change in currency), this
alteration will need to be reflected in all application programs that use
that data file. This problem is known as physical data dependence, and
will be examined in more detail later in the chapter.
• Difficult to implement concurrency: While a data file is being processed
by one application, the file will not be available for other applications or
for ad hoc queries. This is because, if more than one application is allowed
to alter data in a file at one time, serious problems can arise in ensuring
that the updates made by each application do not clash with one another.
This issue of ensuring consistent, concurrent updating of information is an
extremely important one, and is dealt with in detail for database systems
in the chapter on concurrency control. File-based systems avoid these
problems by not allowing more than one application to access a file at one
time.
Review question 1
What is meant by the file-based approach to storing data? Describe some of the
disadvantages of this approach.
Review question 2
How can some of the problems of the file-based approach to data storage be
avoided?
Review question 3
What are the problems that remain with the shared file approach?
The database approach is an improvement on the shared file solution as the use
of a database management system (DBMS) provides facilities for querying, data
security and integrity, and allows simultaneous access to data by a number of
different users. At this point we should explain some important terminology:
6
• Database: A database is a collection of related data.
• Database management system: The term ‘database management sys-
tem’, often abbreviated to DBMS, refers to a software system used to
create and manage databases. The software of such systems is complex,
consisting of a number of different components, which are described later
in this chapter. The term database system is usually an alternative term
for database management system.
• System catalogue/Data dictionary: The description of the data in
the database management system.
• Database application: Database application refers to a program, or
related set of programs, which use the database management system to
perform the computer-related tasks of a particular business function, such
as order processing.
One of the benefits of the database approach is that the problem of physical
data dependence is resolved; this means that the underlying structure of a data
file can be changed without the application programs needing amendment. This
is achieved by a hierarchy of levels of data specification. Each such specification
of data in a database system is called a schema. The different levels of schema
provided in database systems are described below. Further details of what is
included within each specific schema are discussed later in the chapter.
The Systems Planning and Requirements Committee of the American National
Standards Institute encapsulated the concept of schema in its three-level
database architecture model, known as the ANSI/SPARC architecture, which
is shown in the diagram below:
7
ANSI/SPARC three-level architecture
8
user applications. The external schema maps onto the conceptual schema, which
is described below.
There may be many external schemas, each reflecting a simplified model of the
world, as seen by particular applications. External schemas may be modified, or
new ones created, without the need to make alterations to the physical storage
of data. The interface between the external schema and the conceptual schema
can be amended to accommodate any such changes.
The external schema allows the application programs to see as much of the
data as they require, while excluding other items that are not relevant to that
application. In this way, the external schema provides a view of the data that
corresponds to the nature of each task.
The external schema is more than a subset of the conceptual schema. While
items in the external schema must be derivable from the conceptual schema,
this could be a complicated process, involving computation and other activities.
9
itself as physical. This may contrast with the perspective of a systems program-
mer, who may consider data files as logical in concept, but their implementation
on magnetic disks in cylinders, tracks and sectors as physical.
Components of a DBMS
DBMS engine
The engine is the central component of a DBMS. This component provides access
to the database and coordinates all of the functional elements of the DBMS. An
important source of data for the DBMS engine, and the database system as a
whole, is known as metadata. Metadata means data about data. Metadata is
10
contained in a part of the DBMS called the data dictionary (described below),
and is a key source of information to guide the processes of the DBMS engine.
The DBMS engine receives logical requests for data (and metadata) from human
users and from applications, determines the secondary storage location (i.e. the
disk address of the requested data), and issues physical input/output requests
to the computer operating system. The data requested is fetched from physical
storage into computer main memory; it is contained in special data structures
provided by the DBMS. While the data remains in memory, it is managed by the
DBMS engine. Additional data structures are created by the database system
itself, or by users of the system, in order to provide rapid access to data being
processed by the system. These data structures include indexes to speed up
access to the data, buffer areas into which particular types of data are retrieved,
lists of free space, etc. The management of these additional data structures is
also carried out by the DBMS engine.
11
which will be triggered by the actions of users as they use the forms-based
user interface.
• A DBMS procedural programming language, often based on standard
third-generation programming languages such as C and COBOL, which
allows programmers to develop sophisticated applications.
• Fourth-generation languages, such as Smalltalk, JavaScript, etc. These
permit applications to be developed relatively quickly compared to the
procedural languages mentioned above.
• A natural language user interface that allows users to present requests in
free-form English statements.
12
Data integrity management subsystem
The data integrity management subsystem provides facilities for managing the
integrity of data in the database and the integrity of metadata in the dictionary.
This subsystem is concerned with ensuring that data is, as far as software can
ensure, correct and consistent. There are three important functions:
• Intra-record integrity: Enforcing constraints on data item values and types
within each record in the database.
• Referential integrity: Enforcing the validity of references between records
in the database.
• Concurrency control: Ensuring the validity of database updates when
multiple users access the database (discussed in a later chapter).
13
The second of the three environments is often called pre-production. Applica-
tions that have been tested in the development environment will be moved into
pre-production for volume testing; that is, testing with quantities of data that
are typical of the application when it is in live operation.
The final environment is known as the production or live environment. Appli-
cations should only be moved into this environment when they have been fully
tested in pre-production. Security is nearly always a very important issue in the
production environment, as the data being used reflects important information
in current use by the organisation.
Each of these separate environments will have at least one database system,
and because of the widely varying activities and security measures required in
each environment, the volume of data and degree of administration required will
itself vary considerably between environments, with the production database(s)
requiring by far the most support.
Given the need for the database administrator to migrate both programs and
data between these environments, an important tool in performing this pro-
cess will be a set of utilities or programs for migrating applications and their
associated data both forwards and backwards between the environments in use.
14
• Better modelling of real-world data: Databases are based on semantically
rich data models that allow the accurate representation of real-world in-
formation.
• Uniform security and integrity controls: Security control ensures that ap-
plications can only access the data they are required to access. Integrity
control ensures that the database represents what it purports to represent.
• Economy of scale: Concentration of processing, control personal and tech-
nical expertise.
Organisations need data to provide details of the current state of affairs; for
example, the amount of product items in stock, customer orders, staff details,
office and warehouse space, etc. Raw data can then be processed to enable
decisions to be taken and actions to be made. Data is therefore an important
resource that needs to be safeguarded. Organisations will therefore have rules,
standards, policies and procedures for data handling to ensure that accuracy is
maintained and that proper and appropriate use is made of the data. It is for
this reason that organisations may employ data administrators and database
administrators.
It is important that the data administrator is aware of any issues that may af-
fect the handling and use of data within the organisation. Data administration
15
includes the responsibility for determining and publicising policy and standards
for data naming and data definition conventions, access permissions and restric-
tions for data and processing of data, and security issues.
The data administrator needs to be a skilled manager, able to implement policy
and make strategic decisions concerning the organisation’s data resource. It is
not sufficient for the data administrator to propose a set of rules and regulations
for the use of data within an organisation; the role also requires the investigation
of ways in which the organisation can extract the maximum benefit from the
available data.
One of the problems facing the data administrator is that data may exist in
a range of different formats, such as plain text, formatted documents, tables,
charts, photographs, spreadsheets, graphics, diagrams, multimedia (including
video, animated graphics and audio), plans, etc. In cases where the data is avail-
able on computer-readable media, consideration needs to be given to whether
the data is in the correct format.
The different formats in which data may appear is further complicated by the
range of terms used to describe it within the organisation. One problem is
the use of synonyms, where a single item of data may be known by a number
of different names. An example of the use of synonyms would be the terms
‘telephone number’, ‘telephone extension’, ‘direct line’, ‘contact number’ or just
‘number’ to mean the organisation’s internal telephone number for a particular
member of staff. In an example such as this, it is easy to see that the terms
refer to the same item of data, but it might not be so clear in other contexts.
A further complication is the existence of homonyms. A homonym is a term
which may be used for several different items in different contexts; this can
often happen when acronyms are used. One example is the use of the terms
‘communication’ and ‘networking’; these terms are sometimes used to refer to
interpersonal skills, but may also be employed in the context of data communi-
cation and computer networks.
When the items of data that are important to an organisation have been iden-
tified, it is important to ensure that there is a standard representation format.
It might be acceptable to tell a colleague within the organisation that your tele-
phone extension is 5264, but this would be insufficient information for someone
outside the organisation. It may be necessary to include full details, such as
international access code, national code, area code and local code as well as
the telephone extension to ensure that the telephone contact details are usable
worldwide.
Dates are a typical example of an item of data with a wide variety of formats.
The ranges of date formats include: day-month-year, month-day-year, year-
month-day, etc. The month may appear as a value in the range 1 to 12, as the
name of the month in full, or a three-letter abbreviation. These formats can be
varied by changing the separating character between fields from a hyphen (-) to
a slash (/), full stop (.) or space ( ).
16
The use of standardised names and formats will assist an organisation in making
good use of its data. The role of the data administrator involves the creation
of these standards and their publication (including the reasons for them and
guidelines for their use) across the organisation. Data administration provides
a service to the organisation, and it is important that it is perceived as such,
rather than the introduction of unnecessary rules and regulations.
A number of different approaches or models have been developed for the logical
organisation of data within a database system. This ‘logical’ organisation must
be distinguished from the ‘physical’ organisation of data, which describes how
the data is stored on some suitable storage medium such as a disk. The physical
17
organisation of data will be dealt with in the chapter on physical storage. By
far the most commonly used approach to the logical organisation of data is the
Relational model. In this section we shall introduce the basic concepts of the
Relational model, and give examples of its use. Later in the module, we shall
make practical use of this knowledge in both using and developing examples of
Relational database applications.
18
Very often it is required to be able to identify uniquely each of the different
instances of entities in a database. In order to do this we use something called
a primary key. We will discuss the nature of primary keys in detail in the next
learning chapter, but for now we shall use examples where the primary key is
the first of the attributes in each tuple of a relation.
Relation: Stationery
Here, the attributes are item-code, item-name, colour and price. The values for
each attribute for each item are shown as a single value in each column for a
particular row. Thus for item-code 20217, the values are A4 paper 250 sheets
for the item-name, Blue for the attribute colour, and <=2.75 is stored as the
price.
Question: Which of the attributes in the stationery relation do you think would
make a suitable key, and why?
The schema defines the ‘shape’ or structure of a relation. It defines the number
of attributes, their names and domains. Column headings in a table represent
the schema. The extension is the set of tuples that comprise the relation at
any time. The extension (contents) of a relation may vary, but the schema
(structure) generally does not.
From the example above, the schema is represented as:
19
The extension from the above example is given as:
The extension will vary as rows are inserted or deleted from the table, or values
of attributes (e.g. price) change. The number of attributes will not change, as
this is determined by the schema. The number of rows in a relation is sometimes
referred to as its cardinality. The number of attributes is sometimes referred to
as the degree or grade of a relation.
Each relation needs to be declared, its attributes defined, a domain specified for
each attribute, and a primary key identified.
Review question 7
Distinguish between the terms ‘entity’ and ‘attribute’. Give some examples of
entities and attributes that might be stored in a hospital database.
Review question 8
The range of values that a column in a relational table may be assigned is called
the domain of that column. Many database systems provide the possibility of
specifying limits or constraints upon these values, and this is a very effective
way of screening out incorrect values from being stored in the system. It is
useful, therefore, when identifying which attributes or columns we wish to store
for an entity, to consider carefully what is the domain for each column, and
which values are permissible for that domain.
Consider then for the following attributes, what the corresponding domains are,
20
and whether there are any restrictions we can identify which we might use to
validate the correctness of data values entered into attributes with each domain:
• Attribute: EMPLOYEE_NAME
• Attribute: JOB (i.e. the job held by an individual in an organisation)
• Attribute: DATE_OF_BIRTH
Discussion topic
External schemas can be used to give individual users, or groups of users, access
to a part of the data in a database. Many systems also allow the format of the
data to be changed for presentation in the external schema, or for calculations to
be carried out on it to make it more usable to the users of the external schema.
Discuss the possible uses of external schemas, and the sorts of calculations
and/or reformatting that might be used to make the data more usable to specific
users or user groups.
External schemas might be used to provide a degree of security in the database,
by making available to users only that part of the database that they require
in order to perform their jobs. So for example, an Order Clerk may be given
access to order information, while employees working in Human Resources may
be given access to the details of employees.
In order to improve the usability of an external schema, the data in it may be
summarised or organised into categories. For example, an external schema for a
Sales Manager, rather than containing details of individual sales, might contain
summarised details of sales over the last six months, perhaps organised into
categories such as geographical region. Furthermore, some systems provide the
ability to display data graphically, in which case it might be formatted as a bar,
line or pie chart for easier viewing.
21
Chapter 2. The Relational Model
Table of contents
• Objectives
• Introduction
• Context
• Structure of the Relational model
– Theoretical foundations
– Uniform representation of data
– Relation
– Attribute
– Domain
– Tuple
– Degree
– Cardinality
– Primary key
– Foreign keys
– Integrity constraints
∗ Nulls
∗ Entity integrity
∗ Referential integrity
∗ General constraints
• Data manipulation: The Relational Algebra
– Restrict
– Project
– Union
– Intersection
– Difference
– Cartesian product
– Division
– Join
– Activities
∗ Activity 1: Relational Algebra I
∗ Activity 2: Relational Algebra II
∗ Activity 3: Relational Algebra III
• Review questions
• Discussion topics
• Additional content and activities
Objectives
1
a simple but well-founded approach to the storage and manipulation of
data.
• Explain basic concepts of the Relational model, such as primary and for-
eign keys, domains, null values, and entity and referential integrity.
• Be able to discuss in terms of business applications, the value of the above
concepts in helping to preserve the integrity of data across a range of
applications running on a corporate database system.
• Explain the operators used in Relational Algebra.
• Use Relational Algebra to express queries on Relational databases.
Introduction
In parallel with this chapter, you should read Chapter 3 and Chapter 4 of
Thomas Connolly and Carolyn Begg, “Database Systems A Practical Approach
to Design, Implementation, and Management”, (5th edn.).
The aim of this chapter is to explain in detail the ideas underlying the Relational
model of database systems. This model, developed through the ’70s and ’80s,
has grown to be by far the most commonly used approach for the storing and
manipulation of data. Currently all of the major suppliers of database systems,
such as Oracle, IBM with DB2, Sybase, Informix, etc, base their products on
the Relational model. Two of the key reasons for this are as follows.
Firstly, there is a widely understood set of concepts concerning what constitutes
a Relational database system. Though some of the details of how these ideas
should be implemented continue to vary between different database systems,
there is sufficient consensus concerning what a Relational database should pro-
vide that a significant skill base has developed in the design and implementation
of Relational systems. This means that organisations employing Relational tech-
nology are able to draw on this skill-base, as well as on the considerable literature
and consultancy know-how available in Relational systems development.
The second reason for the widespread adoption of the Relational model is ro-
bustness. The core technology of most major Relational products has been tried
and tested over the last 12 or so years. It is a major commitment for an organi-
sation to entrust the integrity, availability and security of its data to software.
The fact that Relational systems have proved themselves to be reliable and se-
cure over a significant period of time reduces the risk an organisation faces in
committing what is often its most valuable asset, its data, to a specific software
environment.
Relational Algebra is a procedural language which is a part of the Relational
model. It was originally developed by Dr E. F. Codd as a means of accessing data
in Relational databases. It is independent of any specific Relational database
product or vendor, and is therefore useful as an unbiased measure of the power
2
of Relational languages. We shall see in a later chapter the further value of
Relational Algebra, in helping gain an understanding of how transactions are
processed internally within the database system.
Context
The Relational model underpins most of the major database systems in com-
mercial use today. As such, an understanding of the ideas described in this
chapter is fundamental to these systems. Most of the remaining chapters of
the module place a strong emphasis on the Relational approach, and even in
those that examine research issues that use a different approach, such as the
chapter on Object databases, an understanding of the Relational approach is
required in order to draw comparisons and comprehend what is different about
the new approach described. The material on Relational Algebra provides a
vendor-independent and standard approach to the manipulation of Relational
data. This information will have particular value when we move on to learn the
Structured Query Language (SQL), and also assist the understanding of how
database systems can alter the ways queries were originally specified to reduce
their execution time, a topic covered partially in the chapter called Database
Administration and Tuning.
Theoretical foundations
Much of the theory underpinning the Relational model of data is derived from
mathematical set theory. The seminal work on the theory of Relational database
systems was developed by Dr E. F. Codd, (Codd 1971). The theoretical devel-
opment of the model has continued to this day, but many of the core principles
were described in the papers of Codd and Date in the ’70s and ’80s.
The application of set theory to database systems provides a robust foundation
to both the structural and data-manipulation aspects of the Relational model.
The relations, or tables, of Relational databases, are based on the concepts of
mathematical sets. Sets in mathematics contain members, and these correspond
to the rows in a relational table. The members of a set are unique, i.e. duplicates
are not allowed. Also the members of a set are not considered to have any
order; therefore, in Relational theory, the rows of a relation or table cannot be
assumed to be stored in any specific order (note that some database systems
allow this restriction to be overridden at the physical level, as in some situations,
for example to improve the performance response of the database, it can be
desirable to ensure the ordering of records in physical storage).
3
Uniform representation of data
We saw in the previous chapter, that the Relational model uses one simple data
structure, the table, to represent information. The rows in tables correspond
to specific instances of records, for example, a row of a customer table contains
information about a particular customer. Columns in a table contain informa-
tion about a particular aspect of a record, for example, a column in a customer
record might contain a customer’s contact telephone number.
Much of the power and robustness of the Relational approach derives from the
use of simple tabular structures to represent data. To illustrate this, consider
the information that might typically be contained in part of the data dictionary
of a database. In a data dictionary, we will store the details of logical table
structures, the physical allocations of disk space to tables, security information,
etc. This data dictionary information will be stored in tables, in just the same
way that, for example, customer and other information relevant to the end users
of the system will be stored.
This consistency of data representation means that the same approach for data
querying and manipulation can be applied throughout the system.
Relation
A relation is a table with columns and tuples. A database can contain as many
tables as the designer wants. Each table is an implementation of a real-world
entity. For example, the university keeps information about students. A student
is represented as an entity during database design stage. When the design is
implemented, a student is represented as a table. You will learn about database
design is later modules.
Attribute
Domain
The domain is the allowable values for a column in the table. For example,
a name of a student can be made of a maximum of 30 lower and upper case
characters. Any combination of lower and upper case characters less or equal
to 30 is the domain for the name column.
4
Tuple
A tuple is the row of the table. Each tuple represents an instance of an entity.
For example, a student table can contain a row holding information about Moses.
Moses is an instance of the student entity.
Degree
Cardinality
Primary key
In the previous chapter, we described the use of primary keys to identify each
of the rows of a table. The essential point to bear in mind when choosing a
primary key is that it must be guaranteed to be unique for each different row of
the table, and so the question you should always ask yourself is whether there
is any possibility that there could be duplicate values of the primary key under
consideration. If there is no natural candidate from the data items in the table
that may be used as the primary key, there is usually the option of using a
system-generated primary key. This will usually take the form of an ascending
sequence of numbers, a new number being allocated to each new instance of a
record as it is created. System-generated primary keys, such as this, are known
as surrogate keys. A drawback to the use of surrogate keys is that the unique
number generated by the system has no other meaning within the application,
other than serving as a unique identifier for a row in a table. Whenever the op-
tion exists therefore, it is better to choose a primary key from the available data
items in the rows of a table, rather than opting for an automatically generated
surrogate key.
Primary keys may consist of a single column, or a combination of columns. An
example of a single table column would be the use of a unique employee number
in a table containing information about employees.
As an example of using two columns to form a primary key, imagine a table in
which we wish to store details of project tasks. Typical data items we might
store in the columns of such a task table might be: the name of the task, date
the task was started, expected completion date, actual completion date, and
the employee number of the person responsible for ensuring the completion of
the task. There is a convenient, shorthand representation for the description of
a table as given above: we write the name of the table, followed by the name of
5
the columns of the table in brackets, each column being separated by a comma.
For example:
TASK (TASK_NAME, START_DATE, EXPECTED_COMP_DATE,
COMP_DATE, EMPNO)
We shall use this convention for expressing the details of the columns of a table
in examples in this and later chapters. Choosing a primary key for the task table
is straightforward while we can assume that task names are unique. If that is
the case, then we may simply use task name as the primary key. However, if we
decide to store in the task table, the details of tasks for a number of different
projects, it is less likely that we can still be sure that task names will be unique.
For example, supposing we are storing the details of two projects, the first to
buy a new database system, and the second to move a business to new premises.
For each of these projects, we might have a task called ‘Evaluate alternatives’. If
we wish to store the details of both of these tasks in the same task table, we can
now no longer use TASK_NAME as a unique primary key, as it is duplicated
across these two tasks.
As a solution to this problem, we can combine the TASK_NAME column with
something further to add the additional context required to provide a unique
identifier for each task. In this case, the most sensible choice is the project name.
So we will use the combination of the PROJECT_NAME and TASK_NAME
data items in our task table in order to identify uniquely each of the tasks in
the table. The task table becomes:
TASK (PROJECT_NAME, TASK_NAME, START_DATE, EXPECTED_COMP_DATE,
COMP_DATE, EMPNO)
We may, on occasions, choose to employ more than two columns as a primary
key in a table, though where possible this should be avoided as it is both un-
wieldy to describe, and leads to relatively complicated expressions when it comes
to querying or updating data in the database. Notice also that we might have
used a system-generated surrogate key as the solution to the problem of pro-
viding a primary key for tasks, but the combination of PROJECT_NAME and
TASK_NAME is a much more meaningful key to users of the application, and
is therefore to be preferred.
Foreign keys
Very often we wish to relate information stored in different tables. For example,
we may wish to link together the tasks stored in the task table described above,
with the details of the projects to which those tasks are related. The simplicity
by which this is achieved within the Relational model, is one of the model’s
major strengths. Suppose the Task and Project tables contain the following
attributes:
6
TASKS (TASK_NAME, START_DATE, EXP_COMP_DATE, COMP_DATE,
EMPNO)
PROJECT (PROJECT_NAME, START_DATE, EXP_COMP_DATE,
COMP_DATE, PROJECT_LEADER)
We assume that TASK_NAME is an appropriate primary key for the TASK
table, and PROJECT_NAME is an appropriate primary key for the PROJECT
table.
In order to relate a record of a task in a task table to a record of a corresponding
project in a project table, we use a concept called a foreign key. A foreign key is
simply a piece of data that allows us to link two tables together. In the case of
the projects and tasks example, we will assume that each project is associated
with a number of tasks. To form the link between the two tables, we place the
primary key of the PROJECT table into the TASK table. The task table then
becomes:
TASK (TASK_NAME, START_DATE, EXP_COMP_DATE, COMP_DATE,
EMPNO, PROJECT_NAME)
Through the use of PROJECT_NAME as a foreign key, we are now able to see,
for any given task, the project to which it belongs. Specifically, the tasks associ-
ated with a particular project can now be identified simply by virtue of the fact
that they contain that project’s name as a data item as one of their attributes.
Thus, all tasks associated with a research project called GENOME RESEARCH,
will contain the value of GENOME RESEARCH in their PROJECT_NAME
attribute.
The beauty of this approach is that we are forming the link using a data item.
We are still able to maintain the tabular structure of the data in the database,
but can relate that data in whatever ways we choose. Prior to the development
of Relational systems, and still in many non-Relational systems today, rather
than using data items in this way to form the link between different entity types
within a database, special link items are used, which have to be created, altered
and removed. These activities are in addition to the natural insertions, updates
and deletions of the data itself. By using a uniform representation of data both
for the data values themselves, and the links between different entity types, we
achieve uniformity of expression of queries and updates on the data.
The need to link entity types in this way is a requirement of all, other than the
most trivial of, database applications. That it occurs so commonly, allied with
the simplicity of the mechanism for achieving it in Relational systems, has been
a major factor in the widespread adoption of Relational databases.
Below is the summary of the concepts we have covered so far:
7
Integrity constraints
Integrity constraints are restrictions that affect all the instances of the database.
Nulls
There is a standard means of representing information that is not currently
known or unavailable within Relational database systems. We say that a column
for which the value is not currently known, or for which a value is not applicable,
is null. Null values have attracted a large amount of research within the database
community, and indeed for the developers and users of database systems they
can be an important consideration in the design and use of database applications
(as we shall see in the chapters on the SQL language).
An important point to grasp about null values is that they are a very specific way
of representing the fact that the data item in question literally is not currently
set to any value at all. Prior to the use of null values, and still in some systems
today, if it is desired to represent the fact that a data item is not currently set
to some value, an alternative value such as 0, or a blank space, will be given
to that data item. This is poor practice, as of course 0, or a blank space, are
perfectly legitimate values in their own right. Use of null values overcomes this
problem, in that null is a value whose meaning is simply that there is no value
currently allocated to the data item in question.
There are a number of situations in which the use of null values is appropriate.
In general we use it to indicate that a data item currently has no value allocated
to it. Examples of when this might happen are:
• When the value of the data item is not yet known.
8
• When the value for that data item is yet to be entered into the system.
• When it is not appropriate that this particular instance of the data item
is given a value.
An example of this last situation might be where we are recording the details
of employees in a table, including their salary and commission. We would store
the salaries of employees in one table column, and the details of commission in
another. Supposing that only certain employees, for example sales staff, are paid
commission. This would mean that all employees who are not sales staff would
have the value of their commission column set to null, indicating that they are
not paid commission. The use of null in this situation enables us to represent
the fact that some commissions are not set to any specific value, because it is
not appropriate to pay commission to these staff.
Another result of this characteristic of null values is that where two data items
both contain null, if you compare them with one another in a query language,
the system will not find them equal. Again, the logic behind this is that the
fact that each data item is null does not mean they are equal, it simply means
that they contain no value at all.
Entity integrity
As briefly discussed in Chapter 1, in Relational databases, we usually use each
table to store the details of particular entity types in a system. Therefore, we
may have a table for Customers, Orders, etc. We have also seen the importance
of primary keys in enabling us to distinguish between different instances of
entities that are stored in the different rows of a table.
Consider for the moment the possibility of having null values in primary keys.
What would be the consequences for the system?
Null values denote the fact that the data item is not currently set to any real
value. Imagine, however, that two rows in a table are the same, apart from the
fact that part of their primary keys are set to null. An attempt to test whether
these two entity instances are the same will find them not equal, but is this
really the case? What is really going on here is that the two entity instances
are the same, other than the fact that a part of their primary keys are as yet
unknown. Therefore, the occurrence of nulls in primary keys would stop us
being able to compare entity instances. For this reason, the column or columns
used to form a primary key are not allowed to contain null values. This rule is
known as the Entity Integrity Rule, and is a part of the Relational theory that
underpins the Relational model of data. The rule does not ensure that primary
keys will be unique, but by not allowing null values to be included in primary
keys, it does avoid a major source of confusion and failure of primary keys.
Referential integrity
9
If a foreign key is present in a given table, it must either match some candidate
value in the home table or be set to null. For example, in our project and task
example above, every value of PROJECT_NAME in the task table must exist
as a value in the PROJECT_NAME column of the project table, or else it must
be set to null.
General constraints
These are additional rules specified by the users or database administrators of a
database, which define or constrain some aspect of the enterprise. For example,
a database administrator can contain the PROJECT_NAME column to have
a maximum of 30 characters for each value inserted.
Restrict
10
Using the Relational Algebra, extract from the relation Singers those individuals
who are over 40 years of age, and create a new relation called Mature-Singers.
Relational Algebra operation: Restrict from Singers where age > 40 giving
Mature-Singers
We can see which tuples are chosen from the relation Singers, and these are
identified below:
The new relation, Mature-Singers, contains only those tuples for singers aged
over 40 extracted from the relation Singers. These were shown highlighted in
the relation above.
Note that Anne Freeman (age 40) is not shown. The query explicitly stated
those over 40, and therefore anyone aged exactly 40 is excluded.
If we wanted to include singers aged 40 and above, we could use either of the
following operations which would have the same effect.
Either:
Relational Algebra operation: Restrict from Singers where age >= 40 giving
Mature-Singers2
11
or
Relational Algebra operation: Restrict from Singers where age > 39 giving
Mature-Singers2
The result of either of these operations would be as shown below:
Project
Project is used on a single relation, and produces a new relation that includes
only those attributes requested from the original relation. This operation has
the effect of choosing columns from the table, with duplicate entries removed.
The task here is to extract the names of the singers, without their ages or any
other information as shown in the diagram below. Note that if there were two
singers with the same name, they would be distinguished from one another in
the relation Singers by having different singer-id numbers. If only the names are
extracted, the information that there are singers with the same name will not
be preserved in the new relation, as only one copy of the name would appear.
12
The column for the attribute singer-names is shown highlighted in the relation
above.
The following Relational Algebra operation creates a new relation with singer-
name as the only attribute.
Relational Algebra operation: Project Singer-name over Singers giving Singer-
names
13
Union
Union (also called ‘append’) forms a new relation of all tuples from two or more
existing relations, with any duplicate tuples deleted. The participating relations
must be union compatible.
It can be seen that the relations Singers and Actors have the same number of
attributes, and these attributes are of the same data types (the identification
numbers are numeric, the names are character fields, the address field is alphanu-
meric, and the ages are integer values). This means that the two relations are
union compatible. Note that it is not important whether the relations have the
same number of tuples.
The two relations Singers and Actors will be combined in order to produce a
new relation Performers, which will include details of all singers and all actors.
An individual who is a singer as well as an actor need only appear once in the
new relation. The Relational Algebra operation union (or append) will be used
14
in order to generate this new relation.
The union operation will unite these two union-compatible relations to form a
new relation containing details of all singers and actors.
Relational Algebra operation: Union Singers and Actors giving Performers
We could also express this activity in the following way:
Relational Algebra operation: Union Actors and Singers giving Performers
These two operations would generate the same result; the order in which the
participating relations are given is unimportant. When an operation has this
property it is known as commutative - other examples of this include addition
and multiplication in arithmetic. Note that this does not apply to all Relational
Algebra operations.
The new relation Performers contains one tuple for every tuple that was in the
relation Singers or the relation Actors; if a tuple appeared in both Singers and
15
Actors, it will appear only once in the new relation Performers. This is why
there is only one tuple in the relation Performers for each of the individuals
who are both actors and singers (Helen Drummond and Desmond Venables), as
they appear in both the original relations.
Intersection
Intersection creates a new relation containing tuples that are common to both
the existing relations. The participating relations must be union compatible.
16
As before, it can be seen that the relations Singers and Actors are union compat-
ible as they have the same number of attributes, and corresponding attributes
are of the same data type.
We can see that there are some tuples that are common to both relations, as
illustrated in the diagram below. It is these tuples that will form the new
relation as a result of the intersection operation.
17
The Relational Algebra operation intersection will extract the tuples common
to both relations, and use these tuples to create a new relation.
Relational Algebra operation: Actors Intersection Singers giving Actor-Singers
The intersection of Actors and Singers produces a new relation containing only
those tuples that occur in both of the original relations. If there are no tuples
that are common to both relations, the result of the intersection will be an
empty relation (i.e. there will be no tuples in the new relation).
Note that an empty relation is not the same as an error; it simply means that
there are no tuples in the relation. An error would occur if the two relations
were found not to be union compatible.
Difference
18
If we want to find out which actors are not also singers, the following Relational
Algebra operation will achieve this:
Relational Algebra operation: Actors difference Singers giving Only-Actors
The result of this operation is to remove from the relation Actors those tuples
that also occur in the relation Singers. In effect, we are removing the tuples in
the intersection of Actors and Singers in order to create a new relation that con-
tains only actors. The diagram below shows which tuples are in both relations.
19
The new relation Only-Actors contains tuples from the relation Actors only if
they were not also present in the relation Singers.
It can be seen that this operation produces a new relation Only-Actors contain-
ing all tuples in the relation Actors except those that also occur in the relation
Singers.
20
If it is necessary to find out which singers are not also actors, this can be done
by the relational operation ‘remove Actors from Singers’, or ‘Singers difference
Actors’. This operation would not produce the same result as ‘Actors difference
Singers’ because the relational operation difference is not commutative (the
participating relations cannot be expressed in reverse order and achieve the
same result).
21
This operation produces a new relation Only-Singers containing all tuples in the
relation Singers except those that also occur in the relation Actors. The tuples
that are removed from one relation when the difference between two relations
is generated are those that are in the intersection of the two relations.
Cartesian product
If a relation called Relation-A has a certain number of tuples (call this number
N), this can be represented as NA (meaning the number of tuples in Relation-A).
Similarly, Relation-B may have a different number of tuples (call this number
M), which can be shown as MB (meaning the number of tuples in Relation-B).
The resulting relation from the operation ‘Relation-A Cartesian product
Relation-B’ forms a new relation containing NA * MB tuples (meaning the
number of tuples in Relation-A times the number of tuples in Relation-B).
Each tuple in the new relation created as a result of this operation will consist
of each tuple from Relation-A paired with each tuple from Relation-B, which
includes all possible combinations.
22
In order to create a new relation which pairs each singer with each role, we need
to use the relational operation Cartesian product.
Relational Algebra operation: Singers Cartesian product Roles giving Singers-
Roles
23
The result of Singers Cartesian product Roles gives a new relation Singers-Roles,
showing each tuple from one relation with each tuple of the other, producing the
tuples in the new relation. Each singer is associated with all roles; this produces
a relation with 16 tuples, as there were 8 tuples in the relation Singers, and 2
tuples in the relation Roles.
What do you think would be the result of the following?
Relational Algebra operation: Singers Cartesian product Singers giving Singers-
Extra
Relational Algebra operation: Actors Cartesian product Roles giving Actors-
Roles
Division
24
If, for example:
Value-A = 2
Value-B = 3
Value-C = 6
then: Value-C is the result of Value-A times Value-B (i.e. 6 = 2 * 3) and: Value-
C divided by Value-A gives Value-B as a result(i.e. 6/2 = 3) and: Value-C
divided by Value-B gives Value-A as a result (i.e. 6/3 = 2)
In Relational Algebra:
If:
Relation-A = Singers
Relation-B = Roles
Relation-C = Singers-Roles
then: Relation-C is the result of Relation-A Cartesian product Relation-B and:
Relation-C divided by Relation-A gives Relation-B as a result
and: Relation-C divided by Relation-B gives Relation-A as a result.
If we start with two relations, Singers and Roles, we can create a new relation
Singers-Roles by performing the Cartesian product of Singers and Roles. This
new relation shows every role in turn with every singer.
25
The new relation Singers-Roles has a special relationship with the relations
Singers and Roles from which it was created, as will be demonstrated below.
We can see that the attributes of the relation Singers are a subset of the at-
tributes of the relation Singers-Roles. Similarly, the attributes of the relation
Roles are also a subset (although a different subset) of the relation Singers-Roles.
If we now divide the relation Singers-Roles by the relation Roles, the resulting
relation will be the same as the relation Singers.
Relational Algebra operation: Singers-Roles divide by Roles giving Our-Singers
26
Similarly, if we divide the relation Singers-Roles by the relation Singers, the
relation that results from this will be the same as the original Roles relation.
Relational Algebra operation: Singers-Roles divide by Singers giving Our-Roles
Note that there are only two tuples in this relation, although the attributes
role-id and role-name appeared eight times in the relation Singer-Roles, as each
role was associated with each singer in turn.
In the case where not all tuples in one relation have corresponding tuples in the
‘dividing’ relation, the resulting relation will only contain those tuples which
are represented in both the ‘dividing’ and ‘divided’ relations. In such a case it
would not be possible to recreate the ‘divided’ relation from a Cartesian product
of the ‘dividing’ and resulting relations. The next example demonstrates this.
Consider the relation Recordings shown below, which holds details of the songs
recorded by each of the singers.
27
Three individuals, Chris, Mel and Sam, have each created two new relations
listing their favourite songs and their favourite singers. The use of the division
operation will enable Chris, Mel and Sam to find out which singers have recorded
their favourite songs, and also which songs their favourite singers have recorded.
The table above contains Chris’s favourite songs. In order to find out which
singers have made a recording of these songs, we need to divide this relation
into the Recordings relation. The result of this Relational Algebra operation is
a new relation containing the details of singers who have recorded all of Chris’s
favourite songs. Singers who have recorded some, but not all, of Chris’s favourite
songs are not included.
Relational Algebra operation: Recordings divide by Chris-Favourite-Songs giv-
28
ing Chris-Singers-Songs
Note that the singers in this relation are not the same as those in the relation
Chris-Favourite-Singers. The reason for this is that Chris’s favourite singers
have not all recorded Chris’s favourite songs.
The next relation shows Chris’s favourite singers. Chris wants to know which
songs these singers have recorded. If we divide the Recordings relation by this
relation, we will get a new relation that contains the songs that all these singers
have recorded; songs that have been recorded by some, but not all, of the singers
will not be included in the new relation.
The singers in this relation are not the same as those who sing Chris’s favourite
songs. The reason for this is that not all of Chris’s favourite singers have
recorded these songs.
In order to discover the songs recorded by Chris’s favourite singer, we need to
divide the relation Recordings by Chris-Favourite-Songs.
Relational Algebra operation: Recordings divide by Chris-Favourite-Songs giv-
ing Chris-Songs-by-Singers
The new relation created by this operation will provide us with the information
about which of Chris’s favourite singers has recorded all of Chris’s chosen songs.
Any singer who has recorded some, but not all, of Chris’s favourite songs will
be excluded from this new relation.
29
We can see that the relation created to identify the songs recorded by Chris’s
favourite singers is not the same as Chris’s list of favourite songs, because these
singers have not all recorded the songs listed as Chris’s favourites.
In this example, we have been able to generate two new relations by dividing
into the Recordings relation. These two new relations do not correspond with
the other two relations that were divided into the relation Recordings, because
there is no direct match. This means that we could not recreate the Record-
ings relation by performing a Cartesian product operation on the two relations
containing Chris’s favourite songs and singers.
In the next example, we will identify the songs recorded by Mel’s favourite
singers, and which singers have recorded Mel’s favourite songs. In common
with Chris’s choice, we will find that the singers and the songs do not match
as not all singers have recorded all songs. If all singers had recorded all songs,
the relation Recordings would be the result of Singers Cartesian product Songs,
but this is not the case.
Mel’s favourite songs include all those that have been recorded, but not all
singers have recorded all songs. Mel wants to find out who has recorded these
songs, but the result will only include those singers who have recorded all the
songs.
There are no other songs that have been recorded from the list available; Mel
has indicated that all of these are favourites.
The relational operation below will produce a new relation, Mel-Singers-Songs,
30
which will contain details of those singers who have recorded all of Mel’s
favourite songs.
Relational Algebra operation: Recordings divide by Mel-Favourite-Songs giving
Mel-Singers-Songs
We can see from this relation, and the one below, that none of Mel’s favourite
singers has recorded all of the songs selected as Mel’s favourites.
We can use the relation Mel-Favourite-Singers to find out which songs have been
recorded by both these performers.
Relational Algebra operation: Recordings divide by Mel-Favourite-Singers giv-
ing Mel-Songs-by-Singers
There is only one song that has been recorded by the singers Mel has chosen, as
shown in the relation below.
We can now turn our attention to Sam’s selection of songs and singers. It
happens that Sam’s favourite song is the same one that Mel’s favourite singers
have recorded.
31
If we now perform the reverse query to find out who has recorded this song, we
find that there are more singers who have recorded this song than the two who
were Mel’s favourites. The reason for this difference is that Mel’s query was
to find out which song had been recorded by particular singers. This contrasts
with Sam’s search for any singer who has recorded this song.
Relational Algebra operation: Recordings divide by Sam-Favourite-Songs giving
Sam-Singer-Songs
Here we can see that Mel’s favourite singers include the performers who have
recorded Sam’s favourite song, but there are many other singers who have also
made a recording of the same song. Indeed, there are only two singers who have
not recorded this song (Desmond Venables and Swee Hor Tan). This could be
considered unfortunate for Sam, as these are the only two singers named as
Sam’s favourites.
32
We know that the only two singers who have not recorded Sam’s favourite song
are in fact Sam’s favourite singers. It is now our task to discover which songs
these two singers have recorded.
Relation Algebra operation: Recordings divide by Sam-Favourite-Singers giving
Sam-Songs-by-Singers
This operation creates a new relation that reveals the identity of the song that
has been recorded by all of Sam’s favourite singers.
There is only one song that these two singers have recorded. Indeed, Swee Hor
Tan has only recorded this song, and therefore whatever other songs had been
recorded by Desmond Venables, this song is the only one that fulfils the criteria
of being recorded by both these performers.
Join
Join forms a new relation with all tuples from two relations that meet a condition.
The relations might happen to be union compatible, but they do not have to
be.
The following two relations have a conceptual link, as the stationery orders have
been made by some of the singers. Invoices can now be generated for each singer
who placed an order. (Note that we would not wish to use Cartesian product
here, as not all singers have placed an order, and not all orders are the same.)
The relation Orders identifies the stationery items (from the Stationery relation)
that have been requested, and shows which customer ordered each item (here
33
the customer-id matches the singer-id).
The Singers relation contains the names and addresses of all singers (who are
the customers), allowing invoices to be prepared by matching the customers who
have placed orders with individuals in the Singers relation.
34
The attribute which links together the two relations (the identification number)
occurs in both original relations, and thus is found twice in the resulting new
relation; the additional occurrence can be removed by means of a project oper-
ation. A version of the join operation, in which such a project is assumed to
occur automatically, is known as a natural join.
The types of join operation that we have used so far, and that are in fact by far
most commonly in use, are called equi-joins. This is because the two attributes
to be compared in the process of evaluating the join operation are compared for
equality with one another. It is possible, however, to have variations on the join
operation using operators other than equality. Therefore it is possible to have
a GREATER THAN (>) JOIN, or a LESS THAN (<) JOIN.
It would have been possible to create the relation Invoices by producing the
Cartesian product of Singers and Orders, and then selecting only those tuples
where the Singer-id attribute from Singers and the Customer-id attribute from
Orders has the same value. The join operation enables two relations which
are not union compatible to be linked together to form a new relation without
generating a Cartesian product, and then extracting only those tuples which
are required.
Activities
35
Activity 2: Relational Algebra II
Describe, using examples, the characteristics of an equi-join and a natural join.
and a relation B with only one attribute (attribute Y). Assume that attribute Y
of relation A and the attribute of relation B are defined on a common domain.
What would be the result of A divided by B if:
1. B = Attribute Y C1
2. B = Attribute Y C2 C3
Review questions
36
4. What happens if you test two attributes, each of which contains the value
null, to find out if they are equal?
5. What is the Entity Integrity Rule?
6. How are tables linked together in the Relational model?
7. What is Relational Algebra?
8. Explain the concept of union compatibility.
9. Describe the operation of the Relational Algebra operators RESTRICT,
PROJECT, JOIN and DIVIDE.
Discussion topics
1. Now that you have been introduced to the structure of the Relational
model, and having seen important mechanisms such as primary keys, do-
mains, foreign keys and the use of null values, discuss what you feel at
this point to be the strengths and weaknesses of the model. Bear in mind
that, although the Relational Algebra is a part of the Relational model,
it is not generally the language used for manipulating data in commer-
cial database products. That language is SQL, which will be covered in
subsequent chapters.
2. Consider the operations of Relational Algebra. Why do you think Rela-
tional Algebra is not used as a general approach to querying and manip-
ulating data in Relational databases? Given that it is not used as such,
what value can you see in the availability of a language for manipulating
data which is not specific to any one developer of database systems?
37
Chapter 3. Introduction to SQL
Table of contents
• Objectives
• Introduction to SQL
• Context
• SQL overview
• The example company database
– The EMP table
– The DEPT table
– The data contained in the EMP and DEPT tables
• SQL SELECT statement
– Simple example queries
– Calculating values and naming query columns
∗ Altering the column headings of query results
• The WHERE clause
– Basic syntax of the WHERE clause
– Examples of using the WHERE clause
– The use of NOT
– The use of !=
– Retrieving from a list of values
– Querying over a range of values
– Searching for partial matches
• Sorting data
– Descending order
– A sort within a sort
• Handling NULL values in query results (the NVL function)
– WHERE clauses using IS NULL and IS NOT NULL
– The NVL function
• REFERENCE MATERIAL: SQL functions
– Arithmetic functions
– Character functions
– Date functions
– Aggregate functions
• Activity - EMPLOYEE AND DEPARTMENT QUERIES
• Review questions
• Discussion topics
Objectives
1
• Use string, arithmetic, date and aggregate functions to perform various
calculations or alter the format of the data to be displayed.
• Sort the results of queries into ascending or descending order.
• Understand the significance of NULL entries and be able to write queries
that deal with them.
Introduction to SQL
In parallel with this chapter, you should read Chapter 5 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
This chapter introduces the fundamentals of the Structured Query Language,
SQL, which is a worldwide standard language for the querying and manipulation
of Relational databases. This chapter covers the basic concepts of the language,
and sufficient information for you to write simple but powerful queries. The fur-
ther chapters on the SQL language will build on this knowledge, covering more
complex aspects of the query language and introducing statements for adding,
changing and removing data and the tables used to contain data. The mate-
rial you will cover in the SQL chapters provides you with a truly transferable
skill, as the language constructs you will learn will work in virtually all cases,
unchanged, across a wide range of Relational systems.
Context
This unit presents the basics of the SQL language, and together with the succeed-
ing units on SQL, provides a detailed introduction to the SQL language. The
unit relates to the information covered on Relational Algebra, in that it provides
a practical example of how the operations of the algebra can be made available
within a higher level, non-procedural language. This chapter also closely relates
to the material we will later cover briefly on query optimisation in a chapter
called Database Administration and Tuning, as it provides the basic concepts
needed to understand the syntax of the language, which is the information on
which the query optimisation software operates.
There are a number of SQL implementations out there, including Microsoft
Access (part of the Office suite), Microsoft SQL server and Oracle. There are also
some open-source ones such as MySQL. You should make sure you have an SQL
implementation installed for this chapter. Consult the course website for more
information about the recommended and/or compatible SQL implementations.
Although SQL commands in these notes are written in generic terms, you should
be mindful that SQL implementations are different and sometimes what is given
here may not work, or will work with slight modification. You should consult
2
the documentation of your software on the particular command should what is
given here not work with your SQL implementation.
SQL overview
SQL is a language that has been developed specifically for querying and ma-
nipulating data in database systems. Its facilities reflect this fact; for example,
it is very good for querying and altering sets of database records collectively
in one statement (this is known as set-level processing). On the other hand, it
lacks some features commonly found in general programming languages, such
as LOOP and IF…THEN…ELSE statements.
SQL stands for Structured Query Language, and indeed it does have a structure,
and is good for writing queries. However, it is structured rather differently to
most traditional programming languages, and it can be used to update informa-
tion as well as for writing queries.
SQL, as supported in most database systems, is provided via a command-line
interface or some sort of graphical interface that allows for the text-based entry
of SQL statements. For example, the following SQL statement is a query that
will list the names of departments from a database table (also known as a
relation) called DEPT:
SELECT DNAME FROM DEPT;
SQL language consists of three major components:
• DDL (data definition language): Used to define the way in which
data is stored.
• DML (data manipulation language): Allows retrieval, insertion of
data, etc. (This is sometimes called the ‘query’ language.)
• DCL (data control language): Used to control access to the data. For
example, granting access to a user to insert data in a particular table.
The query language (DML) is very flexible in that it can be used to express
quite complicated queries, sometimes very concisely.
One initial problem that people just starting to learn the language encounter is
that it can sometimes be difficult to tell how hard a query will be to express in
SQL from its natural language specification. That is, some queries that sound
as though they will be hard to code in SQL from their description in a natural
language such as English, turn out to be very straightforward. However, some
simple-sounding queries turn out to be surprisingly difficult.
As you work through the SQL chapters in this module, you will build up expe-
rience and knowledge of the kinds of queries that are straightforward to write
in SQL.
3
The data manipulation language (DML) of SQL allows the retrieval, insertion,
updating and removal of rows stored in relational tables. As mentioned above,
numbers of rows can be altered in any one statement, and so DML is a very
powerful tool.
The data definition language (DDL) is used to create, change the structure of or
remove whole tables and other relational structures. So whereas you would use
the INSERT statement of the DML to insert new rows into an existing table,
you would use the DDL CREATE TABLE statement to establish a new table
in the first place.
The data control language (DCL) defines activities that are not in the categories
of those for the DDL and DML, such as granting privileges to users, and defining
when proposed changes to a databases should be irrevocably made.
Throughout this and the succeeding chapters on SQL, we are going to use a
standard pair of tables and set of data on which to write SQL statements. This
standard data set comprises the tables EMP and DEPT. The structure of each
is first described, and then the example records for each are presented.
The EMP table stores records about company employees. This table defines and
contains the values for the attributes EMPNO, ENAME, JOB, MGR, HIRE-
DATE, SAL, COMM and DEPTNO.
• EMPNO is a unique employee number; it is the primary key of the em-
ployee table.
• ENAME stores the employee’s name.
• The JOB attribute stores the name of the job the employee does.
• The MGR attribute contains the employee number of the employee who
manages that employee. If the employee has no manager, then the MGR
column for that employee is left set to null.
• The HIREDATE column stores the date on which the employee joined the
company.
• The SAL column contains the details of employee salaries.
• The COMM attribute stores values of commission paid to employees. Not
all employees receive commission, in which case the COMM field is set to
null.
4
• The DEPTNO column stores the department number of the department in
which each employee is based. This data item acts a foreign key, linking the
employee details stored in the EMP table with the details of departments
in which employees work, which are stored in the DEPT table.
The DEPT table stores records about the different departments that employees
work in. This table defines and contains the values for the attributes as follows:
• DEPTNO: The primary key containing the department numbers used to
identify each department.
• DNAME: The name of each department.
• LOC: The location where each department is based.
5
SQL SELECT statement
SQL queries can be written in upper or lower case, and on one or more lines.
All queries in SQL begin with the word SELECT. The most basic form of the
SELECT statement is as follows:
SELECT <select-list> FROM <table-list>
It is often useful to separate the different parts of a query onto different lines,
so we might write this again as:
SELECT <select-list>
FROM <table-list>
Following the SELECT keyword is the list of table columns that the user wishes
to view. This list is known as the select-list. As well as listing the table columns
to be retrieved by the query, the select-list can also contain various SQL func-
tions to process the data; for example, to carry out calculations on it. The
select-list can also be used to specify headings to be displayed above the data
values retrieved by the query. Multiple select-list items are separated from each
other with commas. The select-list allows you to filter out the columns you
don’t want to see in the results.
The FROM keyword is, like the SELECT keyword, mandatory. It effectively
terminates the select-list, and is followed by the list of tables to be used by
the query to retrieve data. This list is known as the table-list. The fact that
the tables need to be specified in the table-list means that, in order to retrieve
data in SQL, you do need to know in which tables data items are stored. This
may not seem surprising from the perspective of a programmer, or database
developer, but what about an end-user? SQL has, in some circles, been put
forward as a language that can be learned and used effectively by business users.
We can see even at this early stage, however, that a knowledge of what data is
stored where, at least at the logical level, is fundamental to the effective use of
the language.
6
Exercise 1 - Fundamentals of SQL query statements
1. What keyword do all SQL query statements begin with?
2. What is the general form of simple SQL query statements?
7
As you can see, the query returns a row for each record in the table, each row
containing a single column presenting the name of the employee (i.e. the value
of the DNAME attribute for each EMP record).
Sample query 2 - all data (rows and columns) from the DEPT table
There are two usual ways to list all data in a table. The simplest is to use a
shorthand notation provided in SQL to list all the columns in any table. This
is done simply by specifying an asterisk ‘*’ for the select-list as follows:
SELECT *
FROM DEPT;
The asterisk is called a wild card, and causes all attributes of the specified table
to be retrieved by the query.
Note that as it is the details of the DEPT table we wish to view, it is the DEPT
table this time that appears in the table-list following the FROM keyword.
8
The use of * in this way is a very easy way to view the entire contents of any
table. The alternative approach is simply to list all of the columns of the DEPT
table in the select-list as follows:
SELECT DEPTNO, DENAME, LOC
FROM DEPT;
The result of executing either of these queries on our DEPT table at this time
is the following:
A potential problem of using the asterisk wild card, is that instead of explicitly
listing all the attributes we want, the behaviour of the query will change if the
table structure is altered — for example, if we add new attributes to the DEPT
table, the SELECT * version of the query will then list the new attributes.
This is a strong motivation for avoiding the use of the asterisk wild card in most
situations.
Sample query 3 - the salary and commission of all employees
If we wish to see details of each employee’s salary and commission we would use
the following query that specifies just those attributes we desire:
SELECT EMPNO, ENAME, SAL, COMM
FROM EMP;
In this example, we have included the EMPNO column, just in case we had any
duplicate names among the employees.
The result of this query is:
9
Calculating values and naming query columns
10
A query that retrieves the number and name of each employee, and calculates
their annual income, is as follows:
SELECT EMPNO, ENAME, 12 * (SAL + COMM)
FROM EMP;
The calculation here adds the monthly commission to the salary, and then mul-
tiplies the result by 12 to obtain the total annual income.
Notice that only records for which the commission value was not NULL have
been included. This issue is discussion later in the chapter. When using some
SQL implementation, such as MS Access, you may have to explicitly request
records with NULL values to be excluded. So the above SQL query:
SELECT EMPNO, ENAME, 12 * (SAL + COMM)
FROM EMP;
may need to be written as:
SELECT EMPNO, ENAME, 12 * (SAL + COMM)
FROM EMP
WHERE COMM IS NOT NULL;
(See later to understand the WHERE part of this query)
Depending on which SQL system you run a query like this, the calculated column
may or may not have a heading. The column heading may be the expression
itself 12 * (SAL + COMM) or may be something indicating that an expression
has been calculated: Expr1004 (these two examples are what happens in Oracle
and MS Access respectively). Since such calculations usually mean something
in particular (in this case, total annual income), it makes sense to name these
calculated columns sensibly wherever possible.
11
Sometimes it is desirable to improve upon the default column headings for query
results supplied by the system, to make the results of queries more intelligible.
For example, the result of a query to calculate annual pay by summing the
monthly salary and commission and multiplying by 12, would by default in some
systems such as Oracle, have the expression of the calculation as the column
heading. The result is more readable, however, if we supply a heading which
clearly states what the compound value actually is, i.e. annual income. To do
this, simply include the required header information, in double quotes, after the
column specification in the select-list. For the annual pay example, this would
be:
SELECT EMPNO, ENAME, 12*(SAL + COMM) “ANNUAL INCOME”
FROM EMP;
The result is more meaningful:
Once again, there are alternative ways to achieve the naming of columns in
some systems including MS Access and MySQL, rather than using the double
quotation marks around the column heading. The use of the keyword AS and
square brackets may also be required.
So the SQL query:
SELECT EMPNO, ENAME, 12*(SAL + COMM) “ANNUAL INCOME”
FROM EMP;
may need to be written in as:
SELECT EMPNO, ENAME, 12*(SAL + COMM) AS ANNUAL INCOME
FROM EMP WHERE COMM IS NOT NULL;
(See next section to understand the WHERE part of this query)
12
The WHERE clause
Very often we wish to filter the records/rows retrieved by a query. That is, we
may only wish to have a subset of the records of a table returned to us by a
query.
The reason for this may be, for example, in order to restrict the employees
shown in a query result just to those employees with a particular job, or with
a particular salary range, etc. Filtering of records is achieved in SQL through
the use of the WHERE clause. In effect, the WHERE clause implements the
functionality of the RESTRICT operator from Relational Algebra, in that it
takes a horizontal subset of the data over which the query is expressed.
The WHERE clause is not mandatory, but when it is used, it must appear
following the table-list in an SQL statement. The clause consists of the keyword
WHERE, followed by one or more restriction conditions, each of which are
separated from one another using the keywords AND or OR.
The format of the basic SQL statement including a WHERE clause is therefore:
SELECT <select-list> FROM <table-list>
[WHERE <condition1> <, AND/OR CONDITION 2, .. CONDITION n>]
The number of conditions that can be included within a WHERE clause varies
from DBMS to DBMS, though in most major commercial DBMS, such as Oracle,
Sybase, Db2, etc, the limit is so high that it poses no practical restriction on
query specifications. We can also use parentheses ‘(’ and ‘)’ to nest conditions
or improve legibility.
13
Note, incidentally, the standard form in which some systems such as Oracle
handle dates: they are enclosed in single quotes, and appear as: DD-MMM-
YYYY (two digits for day ‘dd’, three letters for month ‘mmm’ and four digits
for year ‘yyyy’). In some systems including MS Access, the date should be
enclosed with two hash ‘#’ characters, rather than single quotes - for example,
#01-JAN-1990#. You should check with your system’s documentation for the
requirement as to how the dates should be formatted. Below are the two versions
of the SQL statements, with different formats for dates:
For systems including Oracle:
SELECT EMPNO, ENAME, HIREDATE
FROM EMP
WHERE HIREDATE < ‘01-MAY-1981’;
For systems including MS Access:
SELECT EMPNO, ENAME, HIREDATE
FROM EMP
WHERE HIREDATE < #01-MAY-1981#;
In our example above, we used the < (less than) arithmetic symbol to form
the condition in the WHERE clause. In SQL, the following simple comparison
operators are available:
= equals
!= is not equal to (allowed in some dialects)
< > is not equal to (ISO standard)
< = is less than or equal to
< is less than
> = is greater than or equal to
14
> is greater than
WHERE example 2 - two conditions that must both be true
The logical operator AND is used to specify that two conditions must both be
true. When a WHERE clause has more than one condition, this is called a
compound condition.
Suppose we wish to retrieve all salesmen who are paid more than 1500 a month.
This can be achieved by ANDing the two conditions (is a salesman, and is paid
more than 1500 a month) together in a WHERE clause as follows:
SELECT EMPNO, ENAME, JOB, SAL
FROM EMP
WHERE JOB = ‘SALESMAN’ AND SAL > 1500;
The result of this query is:
Only employees fulfilling both conditions will be returned by the query. Note
the way in which the job is specified in the WHERE clause. This is an example
of querying the value of a field of type character, or as it is called in Oracle, of
type varchar2. When comparing attributes against fixed values of type character
such as SALESMAN, the constant value being compared must be contained in
single quotes, and must be expressed in the same case as it appears in the
table. All of the data in the EMP and DEPT tables is in upper case, so when
we are comparing character values, we must make sure they are in upper case
for them to match the values in the EMP and DEPT tables. In other words,
from a database point of view, the job values of SALESMAN and salesman are
completely different, and if we express a data item in lower case when it is stored
in upper case in the database, no match will be found.
In some systems, including MS Access, the text an attribute is to match should
be enclosed with double quote characters, rather than single quotes. For exam-
ple, “SALESMAN” rather than ‘SALESMAN’:
SELECT EMPNO, ENAME, JOB, SAL
FROM EMP
WHERE JOB = “SALESMAN” AND SAL > 1500;
WHERE example 3 - two conditions, one of which must be true
15
The logical operator OR is used to specify that at least one of two conditions
must be true.
For example, if we wish to find employees who are based in either department
10 or department 20, we can do it by issuing two conditions in the WHERE
clause as follows:
SELECT EMPNO, ENAME, DEPTNO
FROM EMP
WHERE DEPTNO = 10 OR DEPTNO = 20;
The result of this query is:
The keyword NOT can be used to negate a condition, i.e. only records that do
not meet a condition are selected. An example might be to list all employees
who are not salesmen:
SELECT EMPNO, ENAME, JOB, SAL
FROM EMP
WHERE NOT(JOB = ‘SALESMAN’);
16
Another example might be to list all employees who do not earn more than
1500:
SELECT EMPNO, ENAME, JOB, SAL
FROM EMP
WHERE NOT(SAL > 1500 );
17
The use of !=
The operator != can be used to select where some value is NOT EQUAL TO
some other value. So another way to write the query:
SELECT EMPNO, ENAME, JOB, SAL
FROM EMP
WHERE NOT(JOB = ‘SALESMAN’);
is as follows:
SELECT EMPNO, ENAME, JOB, SAL
FROM EMP
WHERE JOB != ‘SALESMAN’;
18
Using this syntax, the previous query would be rewritten as follows:
SELECT EMPNO, ENAME, DEPTNO
FROM EMP
WHERE DEPTNO IN (10, 20);
The result of the query is just the same, but in many cases this form of the
WHERE clause is both shorter and simpler to use.
Note that the BETWEEN operator is inclusive, so a value of 1000 or 2000 would
satisfy the condition and the record included in the query result.
19
An equally valid solution could have been produced by testing whether the
salaries to be returned were >=1000 and <=2000, in which case, the WHERE
clause would have been:
SELECT EMPNO, ENAME, SAL
FROM EMP
WHERE (SAL >=1000) AND (SAL <=2000);
However, this version of the query is longer and more complex, and includes the
need to repeat the SAL attribute for comparison in the second condition of the
WHERE clause.
In general, the solution using BETWEEN is preferable since it is more readable
- it is clearer to a human reading the SQL query code what condition is being
evaluated.
All of the queries we have seen so far have been to retrieve exact matches from
the database. The LIKE keyword allows us to search for items for which we
only know a part of the value. The LIKE keyword in SQL literally means ‘is
approximately equal to’ or ‘is a partial match with’. The keyword LIKE is
used in conjunction with two special characters which can be used as wild card
matches - in other words, LIKE expressions can be used to identify the fact that
we do not know precisely what a part of the retrieved value is.
LIKE example - search for words beginning with a certain letter
As an example, we can search for all employees whose names begin with the
letter S as follows:
SELECT EMPNO, ENAME
FROM EMP
WHERE ENAME LIKE ‘S%”;
This query returns:
20
Here the percentage sign (%) is used as a wild card, to say that we do not know
or do not wish to specify the rest of the value of the ENAME attribute; the only
criteria we are specifying is that it begins with ‘S’, and it may be followed by
no, one or more than one other character.
The percentage sign can be used at the beginning or end of a character string,
and can be used as a wild card for any number of characters.
The other character that can be used as a wild card is the underline character
(_). This character is used as a wild card for only one character per instance of
the underline character. That is, if we code:
WHERE ENAME LIKE ‘S__’;
the query will return employees whose names start with S, and have precisely
two further characters after the S. So employees called Sun or Sha would be
returned, but employee names such as Smith or Salt would not be, as they do
not contain exactly three characters.
Note that we can combine conditions using BETWEEN, or LIKE, with other
conditions such as simple tests on salary, etc, by use of the keywords AND and
OR, just as we can combine simple conditions. However, wild card characters
cannot be used to specify members of a list with the IN keyword.
Sorting data
Data can very easily be sorted into different orders in SQL. We use the ORDER
BY clause. This clause is optional, and when required appears as the last clause
in a query. The ORDER BY keywords are followed by the attribute or attributes
on which the data is to be sorted. If the sort is to be done on more than one
attribute, the attributes are separated by commas.
The general form of an SQL query with the optional WHERE and ORDER BY
clauses looks as follows:
SELECT <select-list> FROM <table-list>
[WHERE <condition1> <, AND/OR CONDITION 2, .. CONDITION n>]
[ORDER BY <attribute-list>]
An example would be to sort the departments into department number order:
SELECT DEPTNO, DNAME
FROM DEPT
ORDER BY DEPTNO;
OR
SELECT DEPTNO, DNAME
21
FROM DEPT
ORDER BY DEPTNO ASC;
Note: SQL provides the keyword ASC to explicitly request ordering in ascending
order.
Descending order
SQL provides the keyword DESC to request sorting in the reverse order. So to
sort the departments into reverse alphabetical order, we can write the following:
SELECT DEPTNO, DNAME
FROM DEPT
ORDER BY DNAME DESC;
22
A sort within a sort
It is very easy to specify a sort within a sort, i.e. to first sort a set of records
into one order, and then within each group to sort again by another attribute.
For example, the following query will sort employees into department number
order, and within that, into employee name order.
SELECT EMPNO, ENAME, DEPTNO
FROM EMP
ORDER BY DEPTNO, ENAME;
The result of this query is:
23
As can be seen, the records have been sorted into order of DEPTNO first, and
then for each DEPTNO, the records have been sorted alphabetically by ENAME.
This can be easily seen if you have a repeating DEPTNO - for example, if we
had two employees, WARD and KUDO, belonging to DEPTNO 7521. Two
DEPTNO 7521 will appear at the end of the table like above, but KUDO will
be on top of WARD under the ENAME column.
When a query includes an ORDER BY clause, the data is sorted as follows:
• Any null values appear first in the sort
• Numbers are sorted into ascending numeric order
• Character data is sorted into alphabetical order
• Dates are sorted into chronological order
We can include an ORDER BY clause with a WHERE clause, as in the following
24
example, which lists all salesman employees in ascending order of salary:
SELECT EMPNO,ENAME,JOB,SAL
FROM EMP
WHERE JOB = ‘SALESMAN’
ORDER BY SAL;
In the chapter introducing the Relational model, we discussed the fact that
NULL values represent the absence of any actual value, and that it is correct
to refer to an attribute being set to NULL, rather than being equal to NULL.
The syntax of testing for NULL values in a WHERE clause reflects this. Rather
than coding WHERE X = NULL, we write WHERE X IS NULL, or, WHERE
X IS NOT NULL.
For example, to return all employees who do not receive a commission, the query
would be:
SELECT EMPNO, ENAME, SAL
FROM EMP
WHERE COMM IS NULL;
25
We can also select records that do not have NULL values:
SELECT EMPNO, ENAME, SAL, COMM
FROM EMP
WHERE COMM IS NOT NULL;
26
The NVL function
27
which will be dealt with later in this chapter, under the heading “Handling
NULL values in query results”.
SQL functions help simplify different types of operations on the data. SQL
supports four types of functions:
• Arithmetic functions
• Character functions
• Date functions
• Aggregate functions
The functions are used as part of a select-list of a query, or if they refer to a
specific row, they may be used in a WHERE clause. They are used to modify
the values or format of data being retrieved.
Arithmetic functions
28
• Trunc(number,d) – truncates number to d decimal places (d can be nega-
tive)
Example:
trunc(sal,2) - truncates values of the SAL attribute to two decimal places
Note: The difference between the round and truncate functions is that
round will round up digits of five or higher, whilst trunc always rounds
down.
• abs
• abs(number) - returns the absolute value of the number
Example:
abs(comm-sal) - returns the absolute value of COMM - SAL; that is, if
the number returned would be negative, the minus sign is discarded
• sign
• sign(number) - returns 1 if number greater than zero, 0 if number = zero,
-1 if number less than zero
Example:
sign(comm-sal) - returns 1 if COMM - SAL > 0, 0 if COMM - SAL = 0,
and - 1 if COMM - SAL < 0
• mod
• mod(number1,number2) - returns the remainder when number1 is divided
by number2
Example:
mod(sal,comm) - returns the remainder when SAL is divided by COMM
• sqrt
• sqrt(number) - returns the square root of the number. If the number is
less than zero then sqrt returns null
Example:
sqrt(sal) - returns the square root of salaries
• to_char
• to_char(number[picture]) - converts a number to a character string in the
format specified
Example:
to_char(sal,9999.99) - represents salary values with four digits before the
decimal point, and two afterwards
29
• decode
• decode(column,starting-value,substituted-value..) - substitutes alterna-
tive values for a specified column
Example:
decode(comm,100,200,200,300,100) - returns values of commission in-
creased by 100 for values of 100 and 200, and displays any other comm
values as if they were 100
• ceil
• ceil(number) - rounds up a number to the nearest integer
Example:
ceil(sal) - rounds up salaries to the nearest integer
• floor
• floor(number) - truncates the number to the nearest integer
Example:
floor(sal) - rounds down salary values to the nearest integer
Character functions
30
• distinct <column> - lists the distinct values of the specific column
Example:
Distinct job - lists all the distinct values of job in the JOB attribute
• length
• length(string) - finds number of characters in the string
Example:
length(ename) - returns the number of characters in values of the ENAME
attribute
• substr
• substr(column,start-position[,length]) - extracts a specified number of
characters from a string
Example:
substr(ename,1,3) - extracts three characters from the ENAME column,
starting from the first character
• instr
• instr(string1,string2[,start-position]) - finds the position of string2 in
string1. The parentheses around the start-position attribute denote that
it is optional
Example:
instr(ename,‘S’) - finds the position of the character ‘S’ in values of the
ENAME attribute
• upper
• upper(string) - converts all characters in the string to upper case
Example:
upper(ename) - converts values of the ENAME attribute to upper case
• lower
• lower(string) - converts all characters in the string to lower case
Example:
lower(ename) - converts values of the ENAME attribute to lower case
• to_number
• to_number(string) - converts a character string to a number
Example:
to_number(‘11’) + sal - adds the value 11 to employee salaries
31
• to_date
• to_date(str[,pict]) - converts a character string in a given format to a date
Example:
to_date(‘14/apr/99’,‘dd/mon/yy’) - converts the character string
‘14/apr/99’ to the standard system representation for dates
• soundex
• soundex(string) - converts phonetically similar strings to the same value
Example:
soundex(‘smith’) - converts all values that sound like the name Smith to
the same value, enabling the retrieval of phonetically similar attribute
values
• vsize
• vsize(string) - finds the number of characters required to store the charac-
ter string
Example:
vsize(ename) - returns the number of bytes required to store values of the
ENAME attribute
• lpad
• lpad(string,len[,char]) - left pads the string with filler characters
Example:
lpad(ename,10) - left pads values of the ENAME attribute with filler char-
acters (spaces)
• rpad
• rpad(string,len[,char]) - right pads the string with filler characters
Example:
rpad(ename,10) - right pads values of the ENAME attribute with filler
characters (spaces)
• initcap
• initcap(string) - capitalises the initial letter of every word in a string
Example:
initcap(job) - starts all values of the JOB attribute with a capital letter
• translate
32
• translate(string,from,to) - translates the occurrences of the ‘from’ string
to the ‘to’ characters
Example:
translate(ename,‘ABC’,‘XYZ’) - replaces all occurrences of the string
‘ABC’ in values of the ENAME attribute with the string ‘XYZ’
• ltrim
• ltrim(string,set) - trims all characters in a set from the left of the string
Example:
ltrim(ename,‘’) - removes all spaces from the start of values of the ENAME
attribute
• rtrim
• rtrim(string,set) - trims all characters in the set from the right of the string
Example:
rtrim(job,‘.’) - removes any full-stop characters from the right-hand side
of values of the JOB attribute
Date functions
The date functions in most commercially available database systems are quite
rich, reflecting the fact that many commercial applications are very date driven.
The most commonly used date functions in SQL are as follows:
• Sysdate Sysdate - returns the current date
• add-months add-months(date, number) - adds a number of
months from/to a date (number can be negative). For example: add-
months(hiredate, 3). This adds three months to each value of the
HIREDATE attribute
• months-between months-between(date1, date2) - subtracts date2
from date1 to yield the difference in months. For example: months-
between(sysdate, hiredate). This returns the number of months between
the current date and the dates employees were hired
• last-day last-day(date) - moves a date forward to last day in the month.
For example: last-day(hiredate). This moves hiredate forward to the last
day of the month in which they occurred
• next-day next-day(date,day) - moves a date forward to the given day
of week. For example: next-day(hiredate,‘monday’). This returns all
hiredates moved forward to the Monday following the occurrence of the
hiredate
33
• round round(date[,precision]) - rounds a date to a specified precision.
For example: round(hiredate,‘month’). This displays hiredates rounded
to the nearest month
• trunc trunc(date[,precision]) - truncates a date to a specified precision.
For example: trunc(hiredate,‘month’). This displays hiredates truncated
to the nearest month
• decode decode(column,starting-value,substituted-value) - sub-
stitutes alternative values for a specified column. For example:
decode(hiredate,‘25-dec-99’,‘christmas day’,hiredate). This displays any
hiredates of the 25th of December, 1999, as Christmas Day, and any
default values of hiredate as hiredate
• to_char to_char(date,[picture]) - outputs the data in the specified
character format. The format picture can be any combination of the
formats shown below. The whole format picture must be enclosed in
single quotes. Punctuation may be included in the picture where required.
Any text should be enclosed in double quotes. The default date format is:
‘dd-mon-yy’. Example: numeric format, description
• cc, century, 20
• y,yyy, year, 1,986
• yyyy, year, 1986
• yyy, last three digits of year, 986
• yy, last two digits of year, 86
• y, last digits of year, 6
• q, quarter of year, 2
• ww, week of year, 15
• w, week of month, 2
• mm, month, 04
• ddd, day of year, 102
• dd, day of month, 14
• d, day of week, 7
• hh or hh12, hour of day (01-12), 02
• hh24, hour of day (01-24), 14
• mi, minutes, 10
• ss, seconds, 5
• sssss, seconds past midnight, 50465
34
• j, julian calendar day, 2446541
The following suffixes may be appended to any of the numeric formats (suffix,
meaning, example):
• th, st, nd, rd, after the number, 14th
• sp, spells the number, fourteen
• spth/st/nd/rd, spells the number, fourteenth
There is also a set of character format pictures (character format, meaning,
example):
• year, year, nineteen-eighty-six
• month, name of month, april
• mon, abbreviated month, apr
• day, day of week, saturday
• dy, abbreviated day, sat
• am or pm, meridian indicator, pm
• a.m. or p.m., meridian indicator, p.m.
• bc or ad, year indicator, ad
• b.c. or a.d., year indicator, a.d.
If you enter a date format in upper case, the actual value will be output in upper
case. If the format is lower case, the value will be output in lower case. If the
first character of the format is upper case and the rest lower case, the value will
be output similarly.
For example:
to-char(hiredate,'dd/mon/yyyy')
to-char(hiredate,'day,"the"Ddspth"of"month')
Aggregate functions
35
Gives the average salary in the employee table, which is 2073.21
• Sum
• Sum(column) - computes the total of all the values in the specified column
and ignores null values
Example:
sum(comm) - calculates the total commission paid to all employees
• min
• min(column) - finds the minimum value in a column
Example:
min(sal) - returns the lowest salary
• max
• max(column) - finds the maximum value in a column
Example:
max(comm) - returns the highest commission
• count
• count(column) - counts the number of values and ignores nulls
Example:
count(empno) - counts the number of employees
• variance
• variance(column) - returns the variance of the group and ignores nulls
Example:
variance(sal) - returns the variance of all the salary values
• Stddev
• Stddev(column) - returns the standard deviation of a set of numbers (same
as square root of variance)
Example:
stddev(comm) - returns the standard deviation of all commission values
Using the EMP and DEPT tables, create the following queries in SQL, and test
them to ensure they are retrieving the correct data.
36
You may wish to review the attributes of the EMP and DEPT tables, which are
shown along with the data near the start of the section called Introduction to
the SQL language.
1. List all employees in the order they were hired to the company.
2. Calculate the sum of all the salaries of managers.
3. List the employee numbers, names and hiredates of all employees who
were hired in 1982.
4. Count the number of different jobs in the EMP table without listing them.
5. Find the average commission, counting only those employees who receive
a commission.
6. Find the average commission, counting employees who do not receive a
commission as if they received a commission of 0.
7. Find in which city the Operations department is located.
8. What is the salary paid to the lowest-paid employee?
9. Find the total annual pay for Ward.
10. List all employees with no manager.
11. List all employees who are not managers.
12. How many characters are in the longest department name?
Review questions
37
Discussion topics
38
Chapter 4. Intermediate SQL
Table of contents
• Objectives
• Introduction
• Context
• Grouping and summarising information
– A very common error with GROUP BY
– The HAVING clause
• Writing queries on more than one table - JOIN
– Avoiding ambiguously named columns
– Outer JOINs
∗ LEFT OUTER JOIN
∗ RIGHT OUTER JOIN
∗ FULL OUTER JOIN
– Using table aliases
– SELF JOINS
– Summary of JOINs
• Nested queries
– The depth and breadth of nested queries
• The UNION operator
• The INTERSECT operator
• The MINUS operator
• ANY or ALL operator
• Correlated sub-queries
• Interactive queries
• Activities
– Activity 1: JOINs
– Activity 2: GROUP BY
– Activity 3: Nested queries
• Review questions
• Discussion topic
• Additional content
Objectives
1
• Combine the results of multiple queries in various ways.
Introduction
In parallel with this chapter, you should read Chapter 5 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
This chapter builds on the foundations laid in the previous chapter, which in-
troduced the SQL language. We examine a range of facilities for writing more
advanced queries of SQL databases, including queries on more than one table,
summarising data, and combining the results of multiple queries in various ways.
Context
This chapter forms the bridge between the chapter in which the SQL language
was introduced, and the coverage of the data definition language (DDL) and
data control language (DCL) provided in the next chapter called Advanced
SQL.
It is possible to express a range of complex queries using the data manipulation
language (DML) previously introduced. The earlier chapter showed how fairly
simple queries can be constructed using the select-list, the WHERE clause to
filter rows out of a query result, and the ORDER BY clause to sort information.
This chapter completes the coverage of the DML facilities of SQL, and will
considerably increase the range of queries you are able to write. The final SQL
chapter will then address aspects of SQL relating to the updating of data and
the manipulation of the logical structures, i.e. tables that contain data.
Information retrieved from an SQL query can very easily be placed into separate
groups or categories by use of the GROUP BY clause. The clause is similar in
format to ORDER BY, in that the specification of the words GROUP BY is
followed by the data item or items to be used for forming the groups. The
GROUP BY is optional. If it appears in the query, it must appear before the
ORDER BY if the ORDER BY is present.
Example: count the number of employees in each department
To answer this question, it is necessary to place the employees in the EMP
table into separate categories, one for each department. This can be done easily
enough through the use of the DEPTNO column in the EMP table as follows
(with the select-list temporarily omitted):
2
SELECT ….
FROM EMP
GROUP BY DEPTNO
As far as counting the employees is concerned, this is an example of something
that is seen very commonly with the GROUP BY clause; that is, the use of
an aggregation function, in this case the COUNT function, in conjunction with
GROUP BY. To complete the query, we simply need to include on the select-
list the DEPTNO column, so that we can see what we are grouping by, and the
COUNT function. The query then becomes:
SELECT DEPTNO,COUNT(EMPNO)
FROM EMP
GROUP BY DEPTNO;
Comments: The query works in two steps. The first step is to group all
employees by DEPTNO. The second step is to count the number of employees
in each group. Of course, headings could be used to improve the clarity of the
results; for example, specifying that the second column is “no. of employees”.
COUNT(EMPNO) AS no. of employees
We have specified between the parentheses of the COUNT function that we
are counting EMPNOs, because we are indeed counting employees. We could
in fact have merely specified an asterisk, “*” in the parentheses of the COUNT
function, and the system would have worked out that we were counting instances
of records in the EMP table, which equates to counting employees. However, it
is more efficient to specify to the system what is being counted.
GROUP BY, like ORDER BY, can include more than one data item, so for
example if we specify:
GROUP BY DEPTNO, JOB
the results will be returned initially categorised within departments, and then
within that, categorised into employees who do the same job.
3
GROUP BY example
Find the average salary for each JOB in the company:
SELECT JOB, AVG(round(SAL,2))
FROM EMP
GROUP BY JOB;
All column names in the select-list must appear in the GROUP BY clause,
unless the name is used only in an aggregate function. Many people when first
starting to use the GROUP BY clause, fall into the trap of asking the system
to retrieve and display information at two different levels. This arises if, for
example, you GROUP BY a data item such as JOB, but then on the select-list,
include a data item at the individual employee level, such as HIREDATE, SAL,
ETC. You might think that we displayed salaries in the previous example, where
we listed the average salaries earned by employees doing the same job. It makes
all the difference, however, that these are average salaries, the averages being
calculated for each category that we are grouping by, in this case the average
salary for each job. It is fine to display average salaries, as these are averaged
across the group, and are therefore at the group level. However, if we had asked
to display individual salaries, we would have had the error message “not a group
by expression”, referring to the fact that SAL is an individual attribute of an
employee, and not in itself an item at the group level. Whenever you see the
“not a group by expression” message, the first thing to check is the possibility
that you have included a request on your select-list to view information at the
individual record level, rather than at the group level. The one individual level
item you can of course include on the select-list, is the item which is shared by
4
a number of individuals that you are in fact grouping by. So when we asked to
retrieve the average salaries for each JOB, it was of course fine to include the
JOB column in the select-list, because for that query, JOB is an item at the
group level, i.e. we were grouping by JOB.
The HAVING clause is used to filter out specific groups or categories of infor-
mation, exactly in the same way that the WHERE clause is used to filter out
individual rows. The HAVING clause always follows a GROUP BY clause, and
is used to test some property or properties of the grouped information.
For example, if we are grouping information at the department level, we might
use a HAVING clause in which to exclude departments with less than a certain
number of employees. This could be coded as follows:
SELECT DEPTNO,COUNT(EMPNO)
FROM EMP
GROUP BY DEPTNO
HAVING COUNT(EMPNO) > 4;
Comments: Department number 10 has four employees in our sample data set,
and has been excluded from the results through the use of the HAVING clause.
The properties that are tested in a HAVING clause must be properties of groups,
i.e. one must either test against individual values of the grouped-by item, such
as:
HAVING JOB = ‘SALESMAN’
OR
JOB = ‘ANALYST’
or test against some property of the group, i.e. the number of members in the
group (as in the example to exclude departments with less than five employees,
or, for instance, tests on aggregate functions of the group - for our data set these
5
could be properties such as the average or total salaries within the individual
groups).
The HAVING clause, when required, always follows immediately after the
GROUP BY clause to which it refers. It can contain compound conditions,
linked by the boolean operators AND or OR (as above), and parentheses may
be used to nest conditions.
6
in a uniform manner has been an important factor in the widespread adoption
of Relational database systems.
A curious feature of performing JOINs, or relating information from different
tables in a logical way as required in the above query, is that although the process
is universally referred to as performing a JOIN, the way it is expressed in SQL
does not always involve the use of the word JOIN. This can be particularly
confusing for newcomers to JOINs. For example, to satisfy the query above, we
would code the WHERE clause as follows:
WHERE EMP.DEPTNO = DEPT.DEPTNO
What this is expressing is that we wish rows in the EMP table to be related to
rows in the DEPT table, by matching rows from the two tables whose depart-
ment numbers (DEPTNOs) are equal. So we are using the DEPTNO column
from each employee record in the EMP table, to link that employee record with
the department record for that employee in the DEPT table.
The full query would therefore be:
SELECT EMPNO,ENAME,DNAME
FROM EMP,DEPT
WHERE EMP.DEPTNO = DEPT.DEPTNO;
This gives the following results for our test data set:
7
A few further points should be noted about the expression of the above query:
• Because we wish to display values of the DNAME attribute in the result,
it has, of course, to be included in the select-list.
• We need not include any mention of the DEPTNO attribute in the select-
list. We require the EMP.DEPTNO and DEPT.DEPTNO columns to
perform the JOIN, so we refer to these columns in the WHERE clause,
but we do not wish to display any DEPTNO information, therefore it is
not included in the select-list.
• As mentioned above, the order in which the EMP and DEPT tables appear
after the FROM keyword is unimportant, at least assuming we can ignore
issues of performance response, which we certainly can for tables of this
size.
• Similarly, the order in which the columns involved in the JOIN operation
are expressed in the WHERE clause is also unimportant.
8
Example on joining two tables
List the names and jobs of employees, together with the locations in which they
work:
SELECT ENAME,JOB,LOC
FROM EMP,DEPT
WHERE EMP.DEPTNO = DEPT.DEPTNO;
Comments: The exercise requires a simple modification to our first JOIN exam-
ple – replacing EMPNO and DNAME in the select-list with the JOB and LOC
attributes. LOC, like DNAME, is stored in the DEPT table, and so requires
the coordination provided by a JOIN, in order to display employee information
along with the locations of the departments in which those employees work.
The SQL standard provides the following alternative ways to specify this join:
FROM EMP JOIN DEPT ON EMP.DEPTNO = DEPT.DEPTNO;
9
FROM EMP JOIN DEPT USING DEPTNO;
FROM EMP NATURAL JOIN DEPT;
In each case, the FROM clause replaces the original FROM and WHERE
clauses.
DEPTNO has been used as the data item to link records in the EMP and DEPT
tables in the above examples. For our EMP and DEPT data set, DEPTNO is
in fact the only semantically sensible possibility for use as the JOIN column. In
the DEPT table, DEPTNO acts as the primary key (and as such must have a
different value in every row within the DEPT table), while in the EMP table,
DEPTNO acts as a foreign key, linking each EMP record with the department
record in the DEPT table to which the employee record belongs. If we wish to
refer to DEPTNO in the select-list, we would need to be careful to specify which
instance of DEPTNO we are referring to: the one in the EMP table, or the one
in the DEPT table. Failure to do this will lead to an error message indicating
that the system is unable to identify which column we are referencing. The way
to be specific about which instance of DEPTNO we require is simply to prefix
the reference to the DEPTNO column with the table name containing that
DEPTNO instance, placing a full stop (.) character between the table name
and column name: for example, EMP.DEPTNO, or DEPT.DEPTNO. In this
way, the system can identify which instance of DEPTNO is being referenced.
In general, if there is any possible ambiguity about which column is being ref-
erenced in a query, because a column with that name appears in more than
one table, we use the table prefixing approach to clarify the reference. Note
that this was not necessary when referencing any of the columns in the example
JOIN queries above, as all of these appeared only once within either the EMP
table or the DEPT table.
Outer JOINs
In addition to the basic form of the JOIN, also called a NATURAL JOIN and
used to relate rows in different tables, we sometimes require a little more syn-
tax than we have seen so far, in order to obtain all the information we require.
Supposing, for example, we wish to list all departments with the employee num-
bers and names of their employees, plus any departments that do not contain
employees.
As a first attempt, we might code:
SELECT DEPT.DEPTNO,DNAME,EMPNO,ENAME
FROM EMP,DEPT
10
WHERE EMP.DEPTNO = DEPT.DEPTNO
ORDER BY DEPT.DEPTNO;
11
we use a construct called an OUTER JOIN. OUTER JOINs are used precisely
in situations where we wish to force into our results set, rows that do and do not
match a usual JOIN condition. There are three types of OUTER JOIN: LEFT,
RIGHT, and FULL OUTER JOINS. To demonstrate the OUTER JOINs, we
will use the following tables.
Person table
The person table holds the information of people. The ID is the primary key.
A person can own a car or not.
Car table
The car table holds information of cars. The REG is the primary key. A car
can have an owner or not.
Comment: The query returns all the persons with cars, plus the one instance
of a person (ID 704555) having no car.
12
RIGHT OUTER JOIN
Like a LEFT JOIN, the syntax of the RIGHT OUTER JOIN involves including
the RIGHT JOIN keyword in the query. An example would be: List all cars
together with their owner’s identification and name, including any car not owned
by anyone.
SELECT REG,MODEL,ID,NAME
FROM Person RIGHT JOIN car ON Person.ID = Car.OWNER;
Comment: The query returns all the cars that are owned, plus the one instance
of a car not owned by anyone.
Table aliasing involves specifying aliases, or alternative names, that can be used
to refer to the table during the processing of a query. The table aliases are
specified in the table-list, following the FROM keyword. For example, the above
FULL OUTER JOIN query can be written using aliases:
SELECT REG,MODEL,ID,NAME
13
FROM Person p FULL JOIN car c ON p.ID = c.OWNER;
SELF JOINS
Comments: Note the use of the aliases for each of the column specifications
in the select-list. We ensure that the alias Y is associated with the employee
JONES through the second condition in the WHERE clause, “AND Y.ENAME
= ‘JONES’ ”. The first condition in the WHERE clause, comparing salaries,
ensures that apart from JONES’ record, which is listed in the result as a check
on the query results, only the details of employees who are paid more than
JONES are retrieved.
14
Summary of JOINs
We have seen three forms of the JOIN condition. The basic JOIN, also called
a NATURAL JOIN, is used to combine or coordinate the results of a query
in a logical way across more than one table. In our examples, we have seen
that JOINing two tables together involves one JOIN condition and, in general,
JOINing N tables together requires the specification of N-1 JOIN conditions.
A lot of work has gone into the development of efficient algorithms for the
execution of JOINs in all the major database systems, with the result being
that the overall performance of Relational database systems has seen a very
considerable improvement since their introduction in the early ’80s. In spite of
this, JOINs are still an expensive operation in terms of query processing, and
there can be situations where we seek ways of reducing the number of JOINs
required to perform specific transactions.
Two further variants we have seen on the basic JOIN operation are the OUTER
JOIN and the SELF JOIN. The OUTER JOIN is used to force non-matching
records from one side of a JOIN into the set of retrieved results. The SELF
JOIN is used where it is required to compare rows in a table with other rows
from the same table. This comparison is facilitated through the use of aliases,
alternative names which are associated with the table, and so can be used to
reference the table on different sides of a JOIN specification.
Nested queries
The power of the SQL language is increased considerably through the ability to
include one query within another. This is known as nesting queries, or writing
sub-queries.
Nested query example
Find all employees who are paid more than JONES:
This might be considered a two-stage task:
1. Find Jones’ salary.
2. Find all those employees who are paid more than the salary found in step
1.
We might code step 1 as follows:
SELECT SAL
FROM EMP
WHERE ENAME = ‘JONES’
The nested query mechanism allows us to enclose this query within another one,
which we might use to perform step 2:
15
SELECT EMPNO,ENAME,SAL
FROM EMP
WHERE SAL > …..
We simply need the syntax to enclose the query to implement step 1 in such a
way that is provides its result to the query which implements step 2.
This is done by enclosing the query for step 1 in parentheses, and linking it to
the query for step 2 as follows:
SELECT EMPNO,ENAME,SAL
FROM EMP
WHERE SAL >
(SELECT SAL FROM EMP WHERE ENAME = ‘JONES’);
This gives the following results:
These are indeed the employees who earn more than JONES (who earns 2975).
Whenever a query appears to fall into a succession of natural steps such as the
one above, it is a likely candidate to be coded as a nested query.
An important point has to be kept in mind when testing for equality of values
across inner and outer queries.
If the inner query returns just one value, then we can use the equal sign, e.g.
SELECT …. FROM ….
WHERE ATTRIBUTE 1 = (SELECT ….
FROM …..)
If, however, the inner query might return more than one row, we must use the
keyword IN, so that we can check whether the value of the attribute being tested
in the WHERE clause of the outer query is IN the set of values returned by the
inner query. Sub-queries can be included linked to a HAVING clause, i.e. they
can retrieve a result which forms part of the condition in the evaluation of a
HAVING clause. In this situation the format of the HAVING clause is:
16
HAVING ….
(SELECT … FROM .. WHERE ….)
The inner query may of course itself have inner queries, with WHERE, GROUP
BY and HAVING clauses.
The number of queries that can be nested varies from one database system to
another, but there is support for this SQL construct in all the leading databases
such that there is no practical limit to the number of queries that can be nested.
In a similar way, a number of queries can be included at the same level of
nesting, their results being combined together using the AND or OR keywords,
according to the following syntax:
SELECT …. FROM ….
WHERE CONDITION 1 (SELECT ….
FROM ….. WHERE …..)
AND/OR (SELECT ….. FROM …. WHERE ….)
AND/OR
……
To find the details of any employees receiving the same salaries as either SCOTT
or WARD, we could code:
SELECT EMPNO,ENAME,SAL
FROM EMP
WHERE SAL IN
(SELECT SAL FROM EMP
WHERE ENAME = ‘WARD’
OR
ENAME = ‘SCOTT’);
But suppose SCOTT and WARD are in different tables. If this is the case, we
need to use the UNION operator in order to combine the results of queries on
two different tables as follows:
SELECT EMPNO,ENAME,SAL
17
FROM EMP
WHERE SAL IN
(SELECT SAL
FROM EMP1
WHERE ENAME = ‘WARD’
UNION
SELECT SAL
FROM EMP2
WHERE ENAME = ‘SCOTT’);
Comments: We are assuming here that WARD is in a table called EMP1, and
SCOTT in EMP2. The two salary values retrieved from these sub-queries are
combined into a single results set, which is retrieved for comparison with all
salary values in the EMP table in the outer query. Because there is more than
one salary returned from the combined inner query, the IN keyword is used to
make the comparison. Note that as with the Relational Algebra equivalent, the
results of the SQL UNION operator must be union compatible, as we see they
are in this case, as they both return single salary columns.
Again, like its Relational Algebra equivalent, the SQL INTERSECT operator
can be used to extract the rows in common between two sets of query results:
SELECT JOB
FROM EMP
WHERE SAL > 2000 INTERSECT
SELECT JOB
FROM SHOPFLOORDETAILS;
Here the INTERSECT operator is used to find all jobs in common between
the two queries. The first query returns all jobs that are paid more than 2000,
whereas the second returns all jobs from a separate table called SHOPFLO-
ORDETAILS. The final result, therefore, will be a list of all jobs in the
SHOPFLOORDETAILS table that are paid more than 2000. Again, note that
the sets of results compared with one another using the INTERSECT operator
must be union compatible.
18
The MINUS operator
The ANY or ALL operators may be used for sub-queries that return more than
one row. They are used on the WHERE or HAVING clause in conjunction with
the logical operators (=, !=, >, >=, <=, <). ANY compares a value to each
value returned by a sub-query.
To display employees who earn more than the lowest salary in Department 30,
enter:
SELECT ENAME, SAL, JOB, DEPTNO
FROM EMP
WHERE SAL >>
ANY
(SELECT DISTINCT SAL
FROM EMP
WHERE DEPTNO = 30)
ORDER BY SAL DESC;
19
Comments: Note the use of the double >> sign, which is the syntax used
in conjunction with the ANY and ALL operators to denote the fact that the
comparison is carried out repeatedly during query execution. “= ANY” is equiv-
alent to the keyword IN. With ANY, the DISTINCT keyword is often used in
the sub-query to avoid the same values being selected several times. Clearly the
lowest salary in department 30 is below 1100.
ALL compares a value to every value returned by a sub-query.
The following query finds employees who earn more than every employee in
Department 30:
SELECT ENAME, SAL, JOB, DEPTNO
FROM EMP
WHERE SAL >>ALL
(SELECT DISTINCT SAL
FROM EMP
WHERE DEPTNO = 30)
20
ORDER BY SAL DESC;
Comments: The inner query retrieves the salaries for Department 30. The
outer query, using the All keyword, ensures that the salaries retrieved are higher
than all of those in department 30. Clearly the highest salary in department 30
is below 2975.
Note that the NOT operator can be used with IN, ANY or ALL.
Correlated sub-queries
21
SELECT EMPNO,ENAME,SAL,DEPTNO
FROM EMP E
WHERE SAL >> (SELECT AVG(SAL)
FROM EMP
WHERE DEPTNO = E.DEPTNO)
ORDER BY DEPTNO;
Giving the results:
Interactive queries
A very useful facility is provided to enable users to run the same query again,
entering a different value of a parameter to a WHERE or HAVING clause. This
is done by prefixing the column specification for which different values are to be
supplied by the “&” sign.
Example
Find the number of clerks based in department 10. Find the number of clerks
in other departments by running the same query, in each case entering the value
of the department number interactively.
SELECT COUNT(EMPNO) “NUMBER OF CLERKS”
FROM EMP
22
WHERE JOB = ‘CLERK’
AND DEPTNO = &DEPTNO
The user will be asked to enter a value for DEPTNO. The result for entering 10
is:
This syntax provides a limited amount of interactivity with SQL queries, which
can avoid the need to recode in order to vary the values of interactively specified
parameters.
Activities
Activity 1: JOINs
Activity 2: GROUP BY
23
Activity 3: Nested queries
Review questions
Discussion topic
24
Additional content
25
Chapter 5. Advanced SQL
Table of contents
• Objectives
• Introduction
• Context
• Creating tables in SQL
– Data types
– Defining primary keys
– Defining foreign keys
– Copying data by combining CREATE TABLE and SELECT
– Copying table structures without data
• The ALTER TABLE statement
– Using ALTER TABLE to add columns
– Modifying columns with ALTER TABLE
• Removing tables using the DROP TABLE statement
– Using DROP TABLE when creating tables
• Adding new rows to table with INSERT
• Changing column values with UPDATE
• Removing rows with DELETE
• Creating views in SQL
– Views and updates
• Renaming tables
• Creating and deleting a database
• Using SQL scripts
• Activities
– Activity 1: Data definition language
– Activity 2: Manipulating rows in tables
– Activity 3: Creating and removing views
• Review questions
• Discussion topic
• Additional content and activities
Objectives
1
Introduction
In parallel with this chapter, you should read Chapter 6 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
This chapter introduces further features of the SQL language, and seeks to
integrate the material of all three chapters which have provided coverage of
SQL in this module. The chapter introduces the means by which tables are
created, changed and removed in SQL. The statements for inserting, updating
and deleting rows from tables are also covered. Views are an important feature
in SQL for tailoring the presentation of data, and acting as a security mechanism.
The statements for creating and using views will be described, along with some
of the inherent limitations of the view mechanism.
Context
This chapter is the final one specifically dedicated to the SQL language, and
so it forms an important role in drawing together the information covered in
all three of the SQL-related chapters of the module. SQL continues to be an
important vehicle for explaining and illustrating concepts in many of the later
chapters of the module, and provides a medium through which many relevant
practical exercises can be performed.
Although this chapter is called Advanced SQL, the material covered is not in
general more difficult than that of previous chapters. The previous two chapters
on SQL have provided a fairly comprehensive coverage of the data manipula-
tion (DML) part of the language, enabling the specification of a wide range
of queries. This chapter introduces the mechanisms for creating, changing and
removing tables, and for inserting, updating and removing rows from tables.
The mechanisms for performing these actions in SQL are relatively straightfor-
ward, but are extremely powerful. Because SQL is a command-level language,
these commands do not include the checks that we have grown to expect from
a typical Graphical User Interface (GUI), and so they must be used with care.
Data definition language (DDL) statements are used for creating, modifying and
removing data objects. They affect both physical and logical data structures.
Their syntax is generally much more varied than the data manipulation language
(DML) statements we have covered in the previous two chapters.
The CREATE statement in SQL can be used to bring into being a range of
different data objects, including the following:
• Data tables
2
• Views on existing tables
• Indexes (data structures which speed up access to data)
• Database user accounts
In this section we shall concentrate on the use of the CREATE statement for
establishing new tables. The CREATE TABLE statement has a range of differ-
ent options, and so we shall start by showing a simplified version of the syntax
as follows:
CREATE TABLE “TABLE NAME” (COLUMN SPECIFICATION 1, …… COL-
UMN SPECIFICATION n);
Where column specification includes:
• A column name
• The data type of the column
• Where appropriate, a specification of the length of the column
• An optional indicator of whether or not the column is to contain null
values
Data objects are subject to a number of restrictions, and these will vary be-
tween different database systems. We shall describe the restrictions on naming
tables and columns in Oracle, as they are fairly typical limitations encountered
in databases generally, the main exceptions being older PC-based database en-
vironments.
• Table names must start with an alphabetic character, and can contain up
to 30 characters.
• Table names can contain the letters A-Z, the numbers 0-9, and the char-
acters – and _.
• Table names must be unique within any specific user account.
• Column names must start with a character, and may comprise up to 30
characters.
• They can contain the same characters as table names.
• Column names must be unique within a table, but you can specify the
same column names in different tables.
• In Oracle there can be up to 254 columns in a table.
Referring to the simplified version of the CREATE TABLE statement above,
the column specifications are contained in parentheses.
3
Data types
We shall focus on three specific data types for use in our practical work, as their
equivalents (though they may be differently named) can be found in almost any
database environment. These data types appear in Oracle as the following:
1. VARCHAR2: is used to store variable-length character strings. In Ora-
cle the strings can store up to 2000 characters. The syntax for specifying
a data item of type VARCHAR2 is:
VARCHAR2 (length)
where length is the maximum length of the character string to be stored.
Note: Some DBMSs, including MySQL, uses VARCHAR instead.
2. NUMBER: is used to store general numbers. The NUMBER data type
offers the greatest flexibility for storing numeric data. It accepts positive
and negative integers and real numbers, and has from 1 to 38 digits of
precision. The syntax for specifying the NUMBER data type is:
NUMBER (precision, scale)
where precision is the maximum number of digits to be stored and scale indicates
number of digits to the right of the decimal point. If scale is omitted, then integer
(whole) numbers are stored.
Note: Some DBMSs, including MySQL, expect you to use exact data types for
numeric data. For example, if you want to hold integers, then you must use
the INT datatype. If you wish to hold decimal numbers, then you must use the
DOUBLE datatype.
3. DATE: is used to specify an attribute is of the type ‘date’. The format
in which dates are represented within attributes of type date is: dd-mon-
yyyy; for example, 10-jan-2000.The syntax to specify an attribute is of
type date is simply to specify the word DATE after the name of the at-
tribute in the CREATE TABLE statement.
Like many other systems, Oracle contains a number of other data types in
addition to the most commonly found ones just described. As an example of
other data types that may be provided, here are a few of the rather more Oracle-
specific data types available:
• Decimal: is used to store fixed-point numbers, and provides compatibility
with IBM’s DB2 and SQL/DS systems.
• Float: is used to store floating-point numbers, and provides compatibility
with the ANSI float datatype.
• Char: is used to store fixed-length character strings. It is most commonly
used for representing attributes of one character in length.
4
• Long: is used to store a little over two gigabytes of data. However, none of
Oracle’s built-in functions and operators can be used to search an attribute
of type ‘long’.
The final part of a column specification allows us to specify whether or not the
column is to contain null values, i.e. whether or not it is a mandatory column.
The syntax is simply to specify either “Not null” or Null after the data type
(and possible length) specification.
Example CREATE TABLE statement
Suppose we wished to create a table called MUSIC_COLLECTION, which we
would use to store the details of CDs, cassettes, minidiscs, etc. This could be
done with the following statement:
CREATE TABLE MUSIC_COLLECTION (ITEM_ID NUMBER(4),
TITLE VARCHAR2(40),
ARTIST VARCHAR2(30),
ITEM_TYPE VARCHAR2(1),
DATE_PURCHASED DATE);
We use a unique numeric identifier called ITEM_ID to identify each of the items
in the collection, as we cannot rely on either the TITLE or ARTIST attributes
to identify items uniquely. The ITEM_TYPE attribute is used to identify which
format the item is in, i.e. cassette, CD, etc.
5
will not allow a null value to be entered for a primary key. In Oracle there is
an upper limit of 16 columns that can be included within a primary key.
Example of creating table with a primary key
If we wanted to specify that ITEM_ID is to be used as the primary key in
the MUSIC-COLLECTION table, we could code the following version of our
CREATE TABLE statement:
CREATE TABLE MUSIC_COLLECTION (ITEM_ID NUMBER(4),
TITLE VARCHAR2(40),
ARTIST VARCHAR2(30), ITEM_TYPE VARCHAR2(1),
DATE_PURCHASED DATE, PRIMARY KEY (ITEM_ID));
A foreign key is used to form the link between rows stored in one table and
corresponding rows in another table. For example, in the sample data set we
have used in the previous two chapters on SQL, the foreign key EMP.DEPTNO
in the employee table was used to link employees to their corresponding depart-
ments in the DEPT table. The CREATE TABLE statement allows us to specify
one or more foreign keys in a table as follows:
CREATE TABLE “TABLE NAME”
(COLUMN SPECIFICATION 1,
COLUMN SPECIFICATION n,
PRIMARY KEY (columnA, …., columnX), CONSTRAINT “constraint name”
FOREIGN KEY (columnAA, …., columnXX) REFERENCES “primary key
specification”)
……………);
As for primary keys, foreign key specifications are not mandatory in a CREATE
TABLE statement. But again, specifying foreign keys is desirable in maintaining
the integrity of the database. Oracle will ensure that a value entered for a
foreign key must either equal a value of the corresponding primary key, or be
null. CREATE TABLE statements can contain both a primary key specification,
and a number of foreign key specifications.
An explanation of the foreign key specification is as follows:
• The first item is the keyword “CONSTRAINT”, followed by an optional
constraint name. Although specifying a constraint name is optional, it
is recommended that you always include it. The reason for this is that
if no constraint name is specified, most systems, including Oracle, will
6
allocate one, and this will not be in any way easy to remember. If you
later wish to refer to the foreign key constraint - for instance, because you
wish to remove it - then providing your own name at the point you enter
the CREATE TABLE statement will make this much easier.
• The words FOREIGN KEY are followed by a list of the columns to be
included in the foreign key, contained in parentheses and separated by
commas (this list of column names is in general different from the list of
column names in the “Primary key” clause).
• REFERENCES is the mandatory keyword, indicating that the foreign key
will refer to a primary key.
• The primary key specification starts with the name of the table containing
the referenced primary key, and then lists the columns comprising the
primary key, contained in parentheses and separated by commas as usual.
The full stops (……………) shown in the version of the syntax above indicate that
there may be more than one foreign key specification.
Example of defining a foreign key in SQL
Supposing we have a second table, which we use to keep track of recording
artists whose recordings we buy. The ARTIST table could be created with the
following statement:
CREATE TABLE ARTIST (
ARTIST_ID NUMBER(2),
ARTIST_NAME VARCHAR2(30),
COUNTRY_OF_ORIGIN VARCHAR2(25),
DATE_OF_BIRTH DATE, PRIMARY KEY (ARTIST_ID));
To relate the MUSIC_COLLECTION table to the ARTIST table, we could
make the following modifications to the CREATE TABLE statement for the
MUSIC_COLLECTION TABLE, which replaces the ARTIST_NAME with a
foreign key reference to the ARTIST-ID:
CREATE TABLE MUSIC_COLLECTION (ITEM_ID NUMBER(4),
TITLE VARCHAR(40),
ARTIST_ID NUMBER(2),
ITEM_TYPE VARCHAR2(1),
DATE_PURCHASED DATE, PRIMARY KEY (ITEM_ID),
CONSTRAINT FK_ARTIST
FOREIGN KEY (ARTIST_ID) REFERENCES ARTIST (ARTIST_ID));
The following points should be noted:
7
1. We have modified the attribute specified on line four, from containing
the Artist name, and being of type Varchar2 and length 30, to be the
ARTIST_ID, of type number and length2.
2. We have then used the ARTIST_ID as the foreign key, which references
the primary key of the table ARTIST.
Note that in general, it would not be possible to make changes such as these
to existing tables. It would require some existing tables and data to be deleted.
Therefore it is good practice to consider very carefully the design of tables and
specification of the primary and foreign keys that are going to be required,
and to specify this correctly the first time in CREATE TABLE statements.
It is however possible to add and remove both primary key and foreign key
constraints, and this will be covered, along with the details of a range of other
constraints mechanisms, in the chapter Declarative Constraints and Database
Triggers.
An extremely useful variant of the CREATE TABLE statement exists for copy-
ing data. Essentially, this consists of using a SELECT statement to provide the
column specifications for the table to be created and, in addition, the data that
is retrieved by the SELECT statement is copied into the new table structure.
The syntax for this form of the statement is as follows:
CREATE TABLE “TABLE NAME”
AS “select statement”;
where “select statement” can be any valid SQL query.
This form of the CREATE TABLE statement can be used to, for example:
• Copy entire tables.
• Copy subsets of tables using the select-list and WHERE clause to filter
rows.
• Create tables which combine data from more than one table (using JOINs).
• Create tables containing aggregated data (using GROUP BY).
Examples of copying data using CREATE TABLE…..SELECT
Example 1:
To create a copy of the EMP table we have used in previous exercises:
CREATE TABLE EMPCOPY
AS SELECT * FROM EMP;
Example 2:
8
To create a table containing a list of employees and their locations we can code:
CREATE TABLE EMPLOC
AS SELECT EMPNO, EMP.DEPTNO, ENAME, LOC
FROM EMP, DEPT
WHERE EMP.DEPTNO = DEPT.DEPTNO;
To examine the contents of the new table:
SELECT *
FROM EMPLOC;
9
Copying table structures without data
Sometimes you may wish to copy the structure of a table without moving any
of the data from the old table into the new one. For example, to take a copy of
the structure of the EMP table, but without copying any employee records into
the new table, we could use:
CREATE TABLE EMPSTRUCT
AS SELECT *
FROM EMP
WHERE 1 = 2;
To verify we have copied the structure:
DESCRIBE EMPSTRUCT
10
The ALTER TABLE statement
The ALTER statement in SQL, like the CREATE statement, can be used to
change a number of different types of data objects, including tables, access
privileges and constraints. Here we shall concentrate on its use to change the
structure of tables.
You can use the ALTER TABLE statement to modify a table’s definition. This
statement changes the structure of a table, not its contents. You can use the
ALTER TABLE statement to:
• Add a new column to an existing table.
• Increase or decrease the width of an existing column.
• Change an existing column from mandatory to optional (i.e. specify that
it may contain nulls).
Columns can be added to existing tables with this form of the ALTER TABLE
statement. The syntax is:
ALTER TABLE “TABLE NAME”
ADD “COLUMN SPECIFICATION 1”,
…………,
“COLUMN SPECIFICATION n”;
For example, to add a department-head attribute to the DEPT table, we could
specify:
ALTER TABLE DEPT
ADD DEPT_HEAD NUMBER(4);
We could imagine that the new DEPT_HEAD column would contain EMPNO
values, corresponding to the employees who were the department heads of par-
ticular departments. Incidentally, if we had wished to make the DEPT_HEAD
field mandatory, we could not have done so, as the ALTER TABLE statement
does not enable the addition of mandatory fields to tables that already contain
data.
We can add a number of columns with one ALTER TABLE statement.
11
ALTER TABLE “TABLE NAME”
MODIFY “COLUMN SPECIFICATION 1”,
………….,
COLUMN SPECIFICATION n“;
For example, to change our copy of the EMP table, called EMPCOPY, so that
the DEPTNO attribute can contain three digit values:
ALTER TABLE EMPCOPY
MODIFY DEPTNO NUMBER(3);
This form of the ALTER TABLE statement can be used to:
• Increase the length of an existing column.
• Transform a column from mandatory to optional (i.e. specify it can contain
nulls).
There are a number of restrictions in the use of the ALTER TABLE state-
ment for modifying columns, most of which might be guessed through a careful
consideration of what is being required of the system. For example, you cannot:
• Reduce the size of an existing column (even if it has no data in it).
• Change a column from being optional to mandatory.
12
necessary to drop the table before issuing the new CREATE TABLE statement.
Clearly this should only be done if the data in the table can be lost, or can be
safely copied elsewhere, perhaps through the use of a CREATE TABLE with a
SELECT clause.
For a little further information about the use of the DROP TABLE statement
when creating tables, see the section on using SQL scripts later in this chapter.
The INSERT statement is used to add rows to an existing table. The statement
has two basic forms:
1. To insert a single row into a table:
INSERT INTO “TABLE NAME” (COLUMN-LIST) VALUES
(LIST OF VALUES TO BE INSERTED);
The COLUMN-LIST describes all of the columns into which data is to be in-
serted. If values are to be inserted for every column, i.e. an entire row is to be
added, then the COLUMN-LIST can be omitted.
The LIST OF VALUES TO BE INSERTED comprises the separate values of
the new data items, separated by commas.
Example 1:
To insert a new row into the table DEPTCOPY (this is a copy of the DEPT
table):
INSERT INTO DEPTCOPY VALUES (50,‘PURCHASING’,‘SAN FRAN-
CISCO’);
1 row created.
Example 2:
To insert a new department for which we do not yet know the location:
INSERT INTO DEPTCOPY (DEPTNO,DNAME)
VALUES (60,‘PRODUCTION’);
1 row created.
2. To insert a number of rows using a SELECT statement.
The syntax for this form of the INSERT statement is as follows:
INSERT INTO “TABLE NAME” (COLUMN-LIST)
“SELECT STATEMENT”;
13
The COLUMN-LIST is optional, and is used to specify which columns are to be
filled when not all the columns in the rows of the target table are to be filled.
The “SELECT STATEMENT” is any valid select statement.
This is, rather like the case of using SELECT with the CREATE TABLE state-
ment, a very powerful way of moving existing data (possibly from separate
tables) into a new table.
Example:
Supposing we have created a table called MANAGER, which is currently empty.
To insert the numbers, names and salaries of all the employees who are managers
into the table we would code:
INSERT INTO MANAGER
SELECT EMPNO, ENAME, SAL
FROM EMP
WHERE JOB = ‘MANAGER’;
3 rows created.
To verify the employees in the table are managers, we can select the data and
compare the jobs of those employees in the original EMP table:
SELECT *
FROM MANAGER;
The UPDATE statement is used to change the values of columns in SQL tables.
It is extremely powerful, but like the DROP statement we encountered earlier,
it does not prompt you about whether you really wish to make the changes you
have specified, and so it must be used with care.
The syntax of the UPDATE statement is as follows:
14
UPDATE “TABLE NAME” SET “column-list” = expression | sub-query
WHERE “CONDITION”;
The SET keyword immediately precedes the column or columns to be updated,
which are specified in the column list. If there is more than one column in the
list, they are separated by commas.
Following the equals sign “=” there are two possibilities for the format of the
value to be assigned. An expression can be used, which may include mathemat-
ical operations on table columns as well as constant values. If an expression is
supplying the update value, then only one column can be updated.
Alternatively, a sub-query or SELECT statement can be used to return the
value or values to which the updated columns will be set. If a sub-query is used
to return the updated values, then the number of columns to be updated must
be the same as the number of columns in the select-list of the sub-query.
Finally, the syntax includes a WHERE clause, which is used to specify which
rows in the target table will be updated. If this WHERE clause is omitted, all
rows in the table will be updated.
Example 1:
To give all the analysts in the copy of the EMP table (called EMPCOPY) a
raise of 10%:
UPDATE EMPCOPY
SET SAL = SAL * 1.1
WHERE JOB = ‘ANALYST’;
2 rows updated.
Example 2:
Suppose we wish to flatten the management structure for the employees stored
in the EMPCOPY table. Recall that the MGR of each employee contains the
employee number of their manager. We might implement this flattening exercise,
at least as far as the database systems are concerned, by setting all employees’
MGR fields to that of KING, who is the president of the company. The update
statement to do this would be as follows:
UPDATE EMPCOPY
SET MGR =
(SELECT EMPNO
FROM EMP
WHERE ENAME = ‘KING’) WHERE ENAME != ‘KING’;
13 rows updated.
15
Note that we have been careful to include the final WHERE clause, in this case
to avoid updating KING’s MGR field.
To verify that the updates have taken place correctly:
SELECT EMPNO,ENAME,MGR
FROM EMPCOPY;
Note that all MGR fields, except that of KING, have been set to 7839, which
is of course KING’s EMPNO. It is a nice feature of the SQL language that we
were able to code this query without knowing KING’s EMPNO value, though
we did have to know something unique about KING in order to retrieve the
EMPNO value from the table. In this case, we used the value of ENAME, but
this is in general unsafe - it would have been better to use KING’s EMPNO
value. Why? We could equally have used the value of JOB, providing we could
rely on there being only one President in the table.
16
Removing rows with DELETE
The DELETE statement is the last of the DDL statements we shall look at in
detail. It is used to remove single rows or groups of rows from a table. Its
format is as follows:
DELETE FROM “TABLE NAME”
WHERE “COLUMN-LIST” = | IN
CONSTANT | EXPRESSION | SUB-QUERY;
As for the UPDATE statement, if the WHERE clause is omitted, all of the rows
will be removed from the table. However, unlike the DROP TABLE statement,
a DELETE statement leaves the table structure in place.
Example 1: To remove an individual employee from the EMPCOPY
table:
DELETE FROM EMPCOPY
WHERE ENAME = ‘FORD’;
1 row deleted.
Note that, had there been more than one employee called FORD, all would have
been deleted.
Example 2: To delete a number of rows based on an expression: To
remove all employees paid more than 2800:
DELETE FROM EMPCOPY
WHERE SAL > 2800;
5 rows deleted.
Example 3: Deleting using a sub-query: To remove any employees
based in the SALES department:
DELETE FROM EMPCOPY
WHERE DEPTNO IN
(SELECT DEPTNO FROM DEPT WHERE DNAME = ‘SALES’);
5 rows deleted
Views are an extremely useful mechanism for providing users with a subset of
the underlying data tables. As such, they can provide a security mechanism, or
simply be used to make the user’s job easier by reducing the rows and columns
of irrelevant data to which users are exposed.
17
Views are the means by which, in SQL databases, individual users are provided
with a logical, tailored schema of the underlying database. Views are in effect
virtual tables, but appear to users in most respects the same as normal base
tables. The difference is that when a view is created, it is not stored like a base
table; its definition is simply used to recreate it for use each time it is required.
In this sense, views are equivalent to stored queries.
Views are created using the CREATE VIEW statement. The syntax of this
statement is very similar to that for creating tables using a SELECT.
Example: To create a view showing the names and hiredates of em-
ployees, based on the EMP table:
CREATE VIEW EMPHIRE
AS SELECT ENAME,HIREDATE
FROM EMP;
View created.
To examine the structure of the view EMPHIRE, we can use the DESCRIBE
command, just as for table objects:
DESCRIBE EMPHIRE
To see the data in the view, we can issue a SELECT statement just as if the
view EMPHIRE is a table:
SELECT *
FROM EMPHIRE;
18
Views and updates
19
Renaming tables
20
Using SQL scripts
SQL statements can be combined into a file and executed as a group. This is
particularly useful when it is required to create a set of tables together, or use a
large number of INSERT statements to enter rows into tables. Files containing
SQL statements in this way are called SQL script files. Each separate SQL
statement in the file must be terminated with a semi-colon to make it run.
Comments can be included in the file with the use of the REM statement, e.g.
REM insert your comment here
The word REM appears at the start of a new line in the script file. REM
statements do not require a semi-colon (;) terminator.
Having created one or more tables, if you then decide you wish to make changes
to them, some of which may be difficult or impossible using the ALTER TABLE
statement, the simplest approach is to drop the tables and issue a new CREATE
TABLE statement which implements the required changes. If the tables whose
structures you wish to change contain any data you wish to retain, you should
first use CREATE TABLE with a sub-query to copy the data to another table,
from which it can be copied back when you have carried out the required table
restructuring.
The restructuring of a number of tables is best implemented by including the
required CREATE TABLE statements in a script file. To avoid errors when this
file is re-run, it is customary to place a series of DROP TABLE statements at
the beginning of the file, one for each table that is to be created. In this way
you can re-run the script file with no problems. The first time it runs, assuming
the tables have not already been created outside the script, the DROP TABLE
statements will raise error messages, but these can be ignored. It is of course
essential, if the tables contain data, to ensure this has been copied to other
tables before such a restructuring exercise is undertaken.
Activities
1. Create the following tables, choosing appropriate data types for the at-
tributes of each table. In your CREATE TABLE statements, create pri-
mary keys for both tables, and an appropriate foreign key for the student
table.
Tables:
TUTOR (TUTOR_ID, TUTOR_NAME, DEPARTMENT, SALARY, AD-
VICE_TIME)
21
STUDENT(STUDENT_NO, STUDENT_NAME, DATE_JOINED, COURSE,
TUTOR_ID)
Important note: Because of the foreign key constraint, you should create the
TUTOR table first, so that it will be available to be referenced by the foreign
key from the STUDENT table when it is created.
Use DESCRIBE to check the table structure.
2. Perform the following checks and modifications to the tables above. After
each modification, use DESCRIBE to verify the change has been made
correctly.
Add an ANNUAL_FEE column to the STUDENT table.
Ensure that the STUDENT_NO field is sufficiently large to accommodate over
10,000 students. If it is not, change it so that it can deal with this situation.
Add an ANNUAL_LEAVE attribute to the tutor table.
Ensure that the tutor’s salary attribute can handle salaries to a precision of two
decimal places. Remove the ADVICE_TIME attribute from the tutor table.
1. Populate the TUTOR and STUDENT tables with appropriate data values.
Ensure that some of the student records you insert have null values in
their foreign key of TUTOR_ID, and that other students have foreign key
values which match the TUTOR_IDs of tutors in the TUTOR table. To
place null values into the TUTOR_ID attribute for a student, you need to
put the word ‘null’ in the position where the TUTOR_ID would appear
in the column-list of the INSERT statement; for example:
INSERT INTO STUDENT VALUES (1505,‘KHAN’,‘04-OCT-1999’,‘COMPUTING’,NULL,5000);
1 row created.
Note that because inserting data a row at a time with the INSERT statement
is rather slow, it is only necessary to put small samples of tutors and students
into the tables; for example, about four tutor records and eight student records
should be sufficient.
2. Use CREATE TABLE with a sub-query to make copies of your TUTOR
and STUDENT tables before you proceed to the following steps of the
activity, which involve updating and removing data. Having done this, if
you accidentally remove more data than you intended, you can copy it
back from your backup tables by dropping the table, and then using the
CREATE TABLE statement with a sub-query.
3. Update a specific student record in order to change the course he or she
is attending.
22
4. Update the TUTOR_ID of a specific student in order to change the tutor
for that student. Write the update statement by using the tutor’s name,
in order to retrieve the TUTOR_ID supplying the update value.
5. Remove all students who do not have a tutor.
Review questions
Discussion topic
23
The three chapters covering SQL have introduced a wide range of mechanisms
for querying and manipulating data in relational tables. This is not the complete
story as far as SQL is concerned, but you have now encountered a major part
of the facilities available within standard implementations of the language. You
are encouraged to discuss with your colleagues your views on the SQL language
that you have been learning. Particular aspects of interest for discussion include:
• What do you feel are the strengths of the language, in terms of learnability,
usability and flexibility?
• On the other hand, which aspects of the language have you found difficult
or awkward either to learn or to use?
• Are there ways in which you feel the language could be improved?
• How does use of the SQL language compare with other database systems
or programming languages you have encountered?
• How feasible is it to use natural language (e.g. English) statements instead
of SQL, to retrieve data in an Oracle database? What are the potential
problems and how might they be overcome?
24
Chapter 6. Entity-Relationship Modelling
Table of contents
• Objectives
• Introduction
• Context
• Entities, attributes and values
– Entities
– Attributes
– Values
– Primary key data elements
– Key
– Candidate keys
– Foreign keys
• Entity-Relationship Modelling
– Entity representation
– One-to-one relationships between two entities
– One-to-many relationships between two entities
– Many-to-many relationships between two entities
– Recursive relationships
• Relationship participation condition (membership class)
– Mandatory and optional relationships
– One-to-one relationships and participation conditions
∗ Both ends mandatory
∗ One end mandatory, other end optional:
∗ One end optional, other end mandatory:
∗ Both ends optional:
– One-to-many relationships and participation conditions
∗ Both ends mandatory:
∗ One end mandatory, other end optional:
∗ One end optional, other end mandatory:
∗ Both ends optional:
– Many-to-many relationships and participation conditions
∗ Both ends mandatory:
∗ One end mandatory, other end optional:
∗ One end optional, other end mandatory:
∗ Both ends optional
• Weak and strong entities
• Problems with entity-relationship (ER) models
– Fan traps
– Chasm traps
• Converting entity relationships into relations
– Converting one-to-one relationships into relations
∗ Mandatory for both entities
∗ Mandatory for one entity, optional for the other entity
1
∗ Optional for both entities
– Converting one-to-many relationships into relations
∗ Mandatory for both entities
∗ Mandatory for one entity, optional for another entity: many end
mandatory
∗ Mandatory for one entity, optional for another entity: many end
optional
∗ Optional for both entities
– Converting many-to-many relationships into relations
∗ Mandatory for both entities
∗ Mandatory for one entity, optional for the other entity
∗ Optional for both entities
– Summary of conversion rules
• Review questions
Objectives
Introduction
In parallel with this chapter, you should read Chapter 11 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
This chapter is the first to address in detail the extremely important topic of
database design. The main approach described in this chapter is called Entity-
Relationship Modelling. This technique has become a widely used approach in
the development of database applications. The approach is essentially top-down,
in that the first step is to look overall at the requirements for the application be-
ing developed, identifying the entities involved. The approach progresses from
that point through the development of a detailed model of the entities, their
attributes and relationships. The Entity-Relationship Modelling process is not
formal, in the mathematical sense, but to be done well, it requires a consistent
precision to be applied to the way that entities, their relationships and their
attributes are discussed. The approach can be supplemented by methods which
2
are more formal in their approach, and that provide a bottom-up perspective
to the design process. The most commonly used of these approaches is Normal-
isation, which will be a core topic of the later chapters on database design.
Context
This chapter introduces the ideas of top-down database design, and provides the
starting point in learning how to develop a database application. The chapter
links closely with the others covering database design (Normalisation and other
design topics). The chapter also has considerable relevance for the material
in the module on performance tuning, such as the chapter on indexing, as the
decisions made during database design have a major impact on the performance
of the application.
Entities
3
• Warehouses
• Stock rooms
Events
• Sale is made
• Purchase order is raised
• Item is hired
• Invoice is issued
Concepts
• Image of product
• Advertising
• Marketing
• Research and development.
Each of these can be regarded as an entity.
Important
Entity
An entity may represent a category of people, things, events, locations or con-
cepts within the area under consideration. An entity instance is a specific exam-
ple of an entity. For example, John Smith is an entity instance of an employee
entity.
Attributes
Entities have attributes. The following are typical of the attributes that an
entity might possess:
Entity: House
Attributes:
Entity: Book
Attributes:
4
Entity: Employee
Attributes:
Important
Attribute
An entity may have one or more attributes associated with it. These attributes
represent certain characteristics of the entity; for a person, attributes might be
name, age, address, etc.
Values
Using the entities and attributes shown above, the following are examples of
one set of values for a particular instance of each entity. Every occurrence of an
entity will have its own set of values for attributes it possesses.
Entity: House
Attributes:
Values:
Entity: Book
Attributes:
Values:
5
Entity: Employee
Attributes:
Values:
If the value of certain attributes (or perhaps just one attribute) is known for
a particular entity, this enables us to discover the value of other attributes
associated with that entity. The attributes (or attribute) which possess this
quality are known as keys, because they are able to ‘unlock’ the values of the
other attributes that are associated with that specific instance of an entity.
Why do we need a key? Suppose we had two members of staff with the same
(or similar) names, such as Linda Clark and Lydia Clark. It would be a simple
mistake to record something in the file of Linda Clark that should be kept in the
file for Lydia Clark (or the other way around). It would be even more difficult
to tell them apart if the name was given as just an initial and surname.
Some names may be spelt slightly differently, but sound similar (such as Clark
and Clarke), and therefore pose a further risk of identifying the wrong member
of staff.
Key
The addition of a staff number as the primary key would enable us to be sure
that when we needed to refer to one or other of these members of staff, we had
identified the correct individual. In this way 11057 Clark can be distinguished
from 28076 Clark.
The following are examples of key data elements:
• The payroll number (primary key) of a member of staff enables us to find
out the name, job title and address for that individual.
6
• The account number (primary key) enables us to find out whether the
balance of that account is overdrawn.
• The item code (primary key) in a stationery catalogue enables us to order
a particular item in a particular size and colour (e.g. a red A4 folder).
Sometimes we may need to use more than one attribute in order to arrive at a
key that will provide unique identification for all the other data elements. When
considering which attribute (or combination of attributes) might be used as a
primary key, these attributes are known as candidate keys.
Candidate keys
Where there is more than one set of attributes which could be chosen as the pri-
mary key for an entity, each of these groups of attributes are known as candidate
keys.
A company might choose either an employee’s staff number or an employee’s
National Insurance number as the primary key, as each will provide unique iden-
tification of an individual. (Note that in different countries, a slightly different
term might be used for a national code that is used to identify any one indi-
vidual, such as national ID number, etc.) The staff number and the National
Insurance number are candidate keys, until one is selected as the primary key.
At times we may refer to a collection of attributes that includes the primary key
(for example, staff number and staff name); this group of attributes is sometimes
known as a superkey.
When we need to connect together different items of data (for example, cus-
tomers and items, in order to produce orders and invoices), we can do this by
including the primary key of one entity as a data item in another entity; for
example, we would include the primary key of Customer in the Order entity to
link customers to the Orders they have placed.
Foreign keys
When a copy of the primary key for one entity is included in the collection of
attributes of another entity, the copy of the primary key held in the second
entity is known as a foreign key.
A foreign key enables a link to be made between different entities.
7
Entity-Relationship Modelling
Entity representation
You may also come across diagrams that employ ellipses to represent the at-
tributes belonging to each entity.
The relationships that exist between two entities can be categorised according
to the following:
• one-to-one
• one-to-many
• many-to-many
In some cases, for simplicity, the attributes are omitted in the entity diagram.
In a concert hall, each ticket holder has a seat for a single performance (the seat
number will appear on the ticket). Only one person can sit in one seat at each
performance; the relationship between a member of the audience and a seat is
therefore one-to-one.
8
Each seat in the concert hall can be sold to one person only for a particular
performance; the relationship between the seat and the member of the audience
with a ticket for that seat is also one-to-one.
Relationships between entities and attributes, between attributes, and between
entities can be shown in a variety of diagrammatic formats. The common format
is to represent each relationship as a line. The style of the line shows the
type of relationship being represented. Here, in order to represent a one-to-one
relationship, a single straight line is used between the two entities.
The overall relationship between ticket holders and seats is one-to-one for each
performance. The entity-relationship diagram above shows the one-to-one link
between a ticket holder and a concert hall seat.
In an orchestra, each individual will play one type of musical instrument; for
example, a person who plays a violin will not play a trumpet. The relationship
is one-to-one from a member of the orchestra to a type of instrument.
9
One-to-many relationships between two entities
An orchestra will have more than one musician playing a particular type of
instrument; for example, it is likely that there will be several members of the
orchestra each playing a violin. The relationship is therefore one-to-many from
a type of musical instrument to a member of the orchestra.
10
of the audience; the relationship between an individual and the concerts is one-
to-many.
Many ticket holders will attend each concert; the relationship between a concert
and members of the audience is also one-to-many.
As the relationship is one-to-many on both sides of the relationship, the rela-
tionship that exists between the two entities can be described as many-to-many.
The entity-relationship diagram above has a ‘crow’s foot’ connection at each end,
illustrating that there is a many-to-many relationship between ticket holders and
concert performances, as one ticket holder may attend many performances, and
each performance is likely to have many ticket holders present.
As it is difficult to implement a many-to-many relationship in a database system,
we may need to decompose a many-to-many relationship into two (or more)
one-to-many relationships. Here, we might say that there is a one-to-many
relationship between a ticket holder and a ticket (each ticket holder may have
several tickets, but each ticket will be held by only one person).
We could also identify a one-to-many relationship between a concert perfor-
mance and a ticket (each ticket for a particular seat will be for only one perfor-
mance, but there will be many performances each with a ticket for that seat).
11
This allows us to represent the many-to-many relationship between ticket holder
and concert performance: two one-to-many relationships involving a new entity
called Ticket For Seat. This new structure can then be implemented within a
Relational database system.
Recursive relationships
The relationships we have seen so far have all been between two entities; this
does not have to be the case. It is possible for an entity to have a relationship
with itself; for example, an entity Staff could have a relationship with itself,
as one member of staff could supervise other staff. This is known as a recur-
sive or involute relationship, and would be represented in an entity-relationship
diagram as shown below.
Exercises
Exercise 1: Identifying entities and attributes
Benchmarque International, a furniture company, keeps details Of items it sup-
plies to homes and offices (tables, chairs, bookshelves, etc). What do you think
would be the entities and attributes the furniture company would need to rep-
resent these items?
Exercise 2: Identification of primary keys
What do you think would make a suitable primary key for the entity (or entities)
representing the tables, chairs, bookshelves and other items of furniture for
Benchmarque International?
In other words, what are the candidate keys?
12
Exercise 3: Identifying relationships
At a conference, each delegate is given a bound copy of the proceedings, contain-
ing a copy of all the papers being presented at the conference and biographical
details of the speakers.
What is the relationship between a delegate and a copy of the proceedings?
Draw the entity-relationship diagram.
Exercise 4: Identifying relationships II
Many papers may be presented at a conference.
Each paper will be presented once only by one individual (even if there are
multiple authors).
Many delegates may attend the presentation of a paper.
Papers may be grouped into sessions (two sessions in the morning and three in
the afternoon).
What do you think is the relationship between:
• a speaker and a paper
• a paper and a session
Exercise 5 — Identifying relationships III
A conference session will be attended by a number of delegates. Each delegate
may choose a number of sessions. What is the relationship between conference
delegates and sessions? Draw the entity-relationship diagram.
13
As there are two kinds of participation conditions (mandatory and optional),
and most entities are involved in binary relationships, it follows that there are
four main types of membership relationships, as follows:
1. Mandatory for both entities
2. Mandatory for one entity, optional for the other
3. Optional for one entity, mandatory for the other
4. Optional for both entities
It might be tempting to think that options 2 and 3 are the same, but it is
important to recognise the difference, particularly when thinking about whether
the relationship is one-to-one, one-to-many or many-to-many. A useful analogy
is to think of a bank, with customers who have savings accounts and loans. It
may be the bank’s policy that any customer must have a savings account before
they are eligible to receive a loan, but not all customers who have savings
accounts will require a loan.
We can examine how these different types of membership classes can be used
to reflect the policies of allocating staff within departments. We would expect
any member of staff in an organisation to work in a given department, but what
happens if a new department is created, or a new member of staff joins? If we
look at each combination in turn, we can see what the possibilities are:
1. Mandatory for both entities: A member of staff must be assigned to
a given department, and any department must have staff. There can be
no unassigned staff, and it is not possible to have an ‘empty’ department.
2. Mandatory for one entity, optional for the other: Any member of
staff must be attached to a department, but it is possible for a department
to have no staff allocated.
3. Optional for one entity, mandatory for the other: A member of
staff does not have to be placed in a department, but all departments
must have at least one member of staff.
4. Optional for both entities: A member of staff might be assigned to
work in a department, but this is not compulsory. A department might,
or might not, have staff allocated to work within it.
We can elaborate the standard entity-relationship notation with a solid circle to
indicate a mandatory entity, and a hollow circle for an optional entity (think
of the hollow circle like ‘o’ for optional). (You may find alternative notations
in other texts - for example, a solid line to represent a mandatory entity, and a
dotted line to indicate an optional entity. Another method places solid circles
inside entity boxes for mandatory participation, or outside entity boxes for op-
tional membership.) The use of a graphical technique enables us to represent
the membership class or participation condition of an entity and a relationship
in an entity-relationship diagram.
14
We will now explore these possibilities using a performer, agents and bookings
scenario as an example, but experimenting with different rules to see what effect
they have on the design of the database. Supposing to start with, we have the
following situation.
There are a number of performers who are booked by agents to appear at dif-
ferent venues. Performers are paid a fee for each booking, and agents earn
commission on the fee paid to each performer. We will now consider relation-
ships of different kinds between these entities.
The solid circle at each end of the relationship shows that the relationship is
mandatory in both directions; each performer must have an agent, and each
agent must deal with one performer.
The solid circle at the performer end of the relationship illustrates that a per-
former must be associated with an agent. The hollow circle at the agent end of
the relationship shows that an agent could be associated with a performer, but
that this is not compulsory. Each performer must have an agent, but not all
agents represent performers.
15
One end optional, other end mandatory:
It might be possible for performers to make bookings themselves, without using
an agent. In this case, one performer might have an agent, and that agent will
make bookings for that performer. On the other hand, a different performer
might elect to make their own bookings, and will not be represented by an agent.
All agents must represent a performer, but not all performers will be represented
by agents. The relationship is optional for the performer, but mandatory for
the agent, as shown in the diagram below.
The solid circle at the agent end of the relationship shows each agent must be
associated with a performer. The hollow circle at the performer end of the
relationship indicates that a performer could be represented by an agent, but
that this is not compulsory. Each agent must deal with only one performer, but
each performer does not have to have an agent.
It might be the case that a performer has only one agent, and that all bookings
for any one performer must be made by one agent, although any agent may
make bookings for more than one performer.
16
The membership class is mandatory for both entities, as shown by the solid
circle. In this case, it is not possible for a booking to be made for an event
that does not involve a performer (for example, a booking could not be for an
exhibition).
The solid circle shows the compulsory nature of the relationship for a performer;
all performers must have bookings. The hollow circle shows that it is optional
for a booking to involve a performer. This means that a performer must have a
booking, but that a booking need not have a performer.
The membership class is mandatory for a booking, but optional for a performer.
This means that it would not be possible for a booking to be for an exhibition, as
all bookings must involve a performer. On the other hand, it is not compulsory
for a performer to have a booking.
17
A performer might have one or more bookings; a booking might be associated
with a performer.
18
In this example, it is still necessary for performers to be represented by a number
of agents, but the agents now have more flexibility as they do not have to make
bookings for performers.
There is a many-to-many relationship between the two entities; one must par-
ticipate, but it is optional for the other entity.
19
Weak and strong entities
An entity set that does not have a primary key is referred to as a weak entity
set. The existence of a weak entity set depends on the existence of a strong
entity set, called the identifying entity set. Its existence, therefore, is dependent
on the identifying entity set.
The relationship must be many-to-one from weak to identifying entity. Par-
ticipation of the weak entity set in the relationship must be mandatory. The
discriminator (or partial key) of a weak entity set distinguishes weak entities
that depend on the same specific strong entity. The primary key of a weak
entity is the primary key of the identifying entity set + the partial key of the
weak entity set.
Example: Many payments are made on a loan
• Payments don’t exist without a loan.
• Multiple loans will each have a first, second payment and so on. So, each
payment is only unique in the context of the loan which it is paying off.
The weak entity is commonly represented by two boxes.
The payment is a weak entity; its existence is dependent on the loan entity.
In this section we examine problems that may arise when creating an ER model.
These problems are referred to as connection traps, and normally occur due to
a misinterpretation of the meaning of certain relationships. We examine two
main types of connection traps, called fan traps and chasm traps, and illustrate
how to identify and resolve such problems in ER models.
Fan traps
These occur when a model represents a relationship between entity types, but
the pathway between certain entity occurrences is ambiguous. Look at the
model below.
20
The above model looks okay at first glance, but it has a pitfall. The model says
a faculty has many departments and many staff. Although the model seems to
capture all the necessary information, it is difficult to know which department
staff are affiliated to. To find out the departments the staff belong to, we will
start from the staff entity. Through the relationship between staff and faculty,
we are able to easily identify the faculties staff belong to. From the faculty, it’s
difficult to know the exact department because one faculty is associated with
many departments.
The model below removes the fan trap from the model.
Chasm traps
These occur when a model suggests the existence of a relationship between entity
types, but the pathway does not exist between certain entity occurrences.
The model represents the facts that a faculty has many departments and each
department may have zero or many staff. We can clearly note that, not all
departments have staff and not all staff belong to a department. Examples of
such staff in a university can include the secretary of the dean. He/she does not
belong to any department.
It’s difficult to answer the question, “Which faculty does the dean’s secretary
belong to?”, as the secretary to the dean does not belong to any department.
We remove the ‘chasm trap’ by adding an extra relationship from staff to faculty.
21
Converting entity relationships into relations
When we have identified the main entities and the relationships that exist be-
tween them, we are in a position to translate the entity-relationship model we
have created from a diagram into tables of data that will form the relations
for our database. The nature of the relationships between entities will make a
difference to the nature of the relations we construct; the cardinality, degree
and membership class will all affect the structure of the database.
If we design a database by using an entity-relationship model, we need to be
able to convert our design from a diagrammatic format into a series of relations
that will hold the values of the actual data items.
It would be possible to create a number of relations so that each represented
either an entity or relationship. This approach would generate a relational
database that represented the entities and the relationships between them as
identified in our data model, but it would suffer from a number of disadvan-
tages. One disadvantage would be that the number of relations created could
result in the database being unnecessarily large. There are also a number of
insertion, update and deletion anomalies, which will be examined in the chapter
on Normalisation, to which a database created in such a way would be vulner-
able. To avoid these problems, we need to specify a method that allows us
to create only those relations that are strictly necessary to represent our data
model as a database. The way we do this is guided by the nature of the rela-
tionships between the entities, in terms of the cardinality and the membership
class (participation condition).
22
dinality (one-to-one, one-to-many or many-to-many) and the membership class
(mandatory or optional) of the entities participating in the relationship. In the
case of one-to-one relationships, the creation of one or two relations is sufficient,
depending on whether participation is mandatory or optional.
In the relation Performer-details above, we can see that all performer informa-
tion is stored and can be accessed by the performer-id attribute, and all agent
information can be extracted by means of the agent-id attribute.
As the relationship is one-to-one and mandatory in both directions, we do not
need to store the performers and agents in separate relations, although we could
choose to do so. (If we stored performers and agents in separate relations, we
would then need to use the identifying attributes of performer-id and agent-id
as foreign keys. This means that we would be able to identify the relevant agent
in the Performer relation, and identify the appropriate performer in the Agent
relation.)
23
Mandatory for one entity, optional for the other entity
In this case, two relations will be needed, one for each entity. The relationship
could be mandatory for the first entity and optional for the second, or the other
way around. There are therefore two possibilities for performers and agents.
In this first example, a performer must be represented by an agent, but an agent
does not have to represent a performer. The relationship is therefore mandatory
for a performer, but optional for an agent.
This would convert into two relations, one for each entity. The agent identifier is
stored in the Performer relation in order to show the connection between agents
and performers where appropriate. This is known as posting an identifier (or
posting an attribute). It is important that the value of a posted identifier is not
null.
Relation: Performer
Note that the agent identifier, agent-id, is held in the Performer relation. The
attribute agent-id is a foreign key in the Performer relation. This means that
we can identify which agent represents a particular performer.
We would not want to store the performer-id in the Agent relation for this
example, as there are agents who do not represent performers, and there would
therefore be a null value for the performer-id attribute in the Agent relation.
We can see that there are agents in the Agent relation who do not represent
performers, but all performers are represented by only one agent.
Relation: Agent
24
In the second example, an agent must represent a performer, but a performer
does not need to have an agent. Here, the relationship is optional for a performer,
but mandatory for an agent.
Again, this would translate into two relations, one for each entity. On this
occasion, however, the link between performers and agents will be represented
in the Agent relation rather than the Performer relation. This is because every
agent will be associated with a performer, but not all performers will be linked
to agents. The performer-id is a foreign key in the Agent relation. We cannot
have the agent identifiers in the Performer relation as in some instances there
will be no agent for a performer, and a null value for an agent identifier is not
allowed, as it would contravene the rules on entity integrity.
Relation: Performer
Relation: Agent
25
Optional for both entities
In this scenario, a performer might or might not have an agent. Similarly,
an agent might or might not represent a performer. However, if a performer
does have an agent, that agent will not represent any other performers. The
relationship between the two entities is one-to-one, but optional on both sides.
In order to convert this relationship into a relational format, three relations will
be needed, one for each entity and one for the relationship.
This means that it is possible to have a performer without an agent, and it
is also permissible for an agent to have no performers. All performer details
will be stored in the Performers relation, and all agent data will be held in the
Agent relation. Where there is a performer with an agent, this will be shown in
the relation Works-with, which will represent the relationship between the two
entities.
Relation: Performer
The relation Performers holds details of all the performers relevant to the
database.
Relation: Agents
26
All agents within the database are stored in the relation Agents.
Relation: Works-with
Note that the relation Works-with only has entries for those agents and per-
formers who are linked together.
If we convert this part of our data model into tables of data, we will have two
relations (one for each entity). In order to maintain the relationship that exists
between the two entities, we will hold a copy of the primary key of the entity
at the “one” end of the relationship as one of the attributes associated with the
entity at the “many” end of the relationship. In this example, the attribute
agent-id is a foreign key in the relation Performers.
Relation: Performers
27
Relation: Agents
Mandatory for one entity, optional for another entity: many end
mandatory
In this example, all performers must be represented by agents, and each per-
former has only one agent. The agents themselves need not be responsible for
making bookings for performers, and can be involved in other activities.
The mandatory nature of the relationship for the performer is shown by the
solid circle; the hollow circle indicates an optional relationship for an agent.
This means that there must be a relation to represent performers, and another
relation to represent agents. The links between performers and agents are shown
by having the agent identifier stored against the appropriate performer in the
Performer relation. The attribute agent-id is therefore a foreign key in the
Performer relation. All performers must have an agent associated with them,
but not all agents will be involved in a booking for a performer.
Relation: Performers
28
Relation: Agents
Mandatory for one entity, optional for another entity: many end
optional
Here, agents may make bookings for performers, and performers may also make
bookings for themselves. It is only possible for agents to make bookings for
functions that involve performers. An agent may be responsible for making
bookings for more than one performer. If a performer is represented by an
agent, each performer may have only one agent.
The mandatory nature of the relationship for the agent is shown by the solid
circle; the hollow circle indicates an optional relationship for a performer. This
means that there must be a relation to represent performers, another relation
to represent agents, and a third relation to represent those occasions when
performers have booked through agents. The links between performers and
agents are shown by having the agent identifier stored against the appropriate
29
performer in the third relation.
Relation: Performers
Relation: Agents
Relation: Agent-Performer
30
This relationship can be converted into three relations. There will be one re-
lationship to represent the performers, another for the agents, and a third will
store details of the relationship between performers and agents (where such a
relationship exists).
Relation: Performers
Relation: Agents
Relation: Agent-Performer
31
We can see from these relations that a performer may be represented by an
agent, and an agent may represent more than one performer. Some performers
do not have agents, and some agents do not represent performers.
32
Relation: Agents
Relation: Agent-Performers
33
The Agent-Performers relation shows us that all performers are represented
by agents, and that all agents represent performers. Some performers are repre-
sented by more than agent, and some agents represent more than one performer.
We now have three relations representing the many-to-many relationship manda-
tory for both entities.
The entity relationship diagram above shows that it is mandatory for performers,
but optional for agents to participate. This is translated into three relations be-
low. Note that in the relation Agent-Performers, all performers are represented
by an agent (or more than one agent). There are some agents in the Agent
relation who do not appear in Agent-Performers because they do not represent
performers.
34
Relation: Performers
Relation: Agents
Relation: Agent-Performers
35
The second possibility for this kind of relationship is that the performer entity
is optional but the agent entity is mandatory. In this case, a performer might
have one or more agents, but an agent must represent several performers. Here,
a performer could make a booking personally, or could have a booking made by
a number of different agents. The agents can only make bookings for performers,
and for no other kind of event.
Relation: Agents
Relation: Agent-Performers
36
The relation Agent-Performers shows that all agents represent one or more per-
formers. Some performers are represented by more than one agent, whereas
other performers are not represented by agents at all.
In order to represent this relationship between two entities, we would need three
relations, one for each entity and one for the relationship itself. The reason
we need three relations rather than just two (one for each entity) is that the
relationship is optional. This means that if we were to store the identifier of one
entity in the relation of the other, there would be times when we would have a
null value for the identifier as no relationship exists for a particular instance of
the entity. We cannot have a null value for an identifier, and therefore we show
the relationships that do exist explicitly in a third relation.
Relation: Performers
37
Relation: Agents
Relation: Agent-Performers
38
Summary of conversion rules
The following table provides a summary of the guidelines for converting com-
ponents of an entity-relationship diagram into relations. We need to be certain
that if we store an identifier for one entity in a relation representing another
entity, that the identifier never has a null value. If we have a null value for
an identifier, we will never be able to find the other details that should be
associated with it.
39
Review questions
40
about areas where you don’t have enough information, and how you would deal
with this kind of problem. You might also find that there is information that
you don’t need for building the data model.
“Authors are responsible for writing plays that are performed in theatres. Every
time a play is performed, the author will be paid a royalty (a sum of money for
each performance).
Plays are performed in a number of theatres; each theatre has maximum audi-
torium size, and many people attend each performance of a play. Many of the
theatres have afternoon and evening performances.
Actors are booked to perform roles in the plays; agents make these bookings
and take a percentage of the fee paid to the actor as commission. The roles in
the plays can be classified as leading or minor roles, speaking or non-speaking,
and male or female.”
• Explain the difference between entities and attributes. Give examples of
each.
• Distinguish between the terms ‘entity type’ and ‘entity instance’, giving
examples.
• Distinguish between the terms ‘primary key’ and ‘candidate key’, giving
examples.
• Explain what is meant by one-to-one, one-to-many and many-to-many
relationships between entities, giving an example of each.
• How are many-to-many relationships implemented in Relational
databases?
41
Chapter 7. Enhanced Entity-Relationship Mod-
elling
Table of contents
• Objectives
• Introduction
• Context
• Recap on previous concepts
– Entities
– Relationship types
– Relationship participation
• Specialization/generalization
– Representation of specialization/generalization in ER diagrams
– Constraints on specialization/generalization
– Mapping specialization/generalization to relational tables
• Aggregation
– Representation of aggregation in ER diagrams
• Composition
– Representation of composition in ER diagrams
• Additional content - XML
– What is XML?
∗ Element
∗ Attribute
∗ Example representing relational table records in XML
– Document type definition
– Namespaces
– XQuery
Objectives
1
Introduction
In parallel with this chapter, you should read Chapter 12 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
This chapter builds on the previous chapter which addressed the basic concepts
of Entity-Relationship (ER) modelling. The chapter discussed the concepts of
an entity, participation, recursive relationships, weak entities and strong entities.
It also illustrated how these concepts can be represented in the ER diagrams.
Improved computer speed and memory has, in recent years, triggered the de-
velopment of sophisticated software applications like Geographical Information
Systems (GIS). The basic features of ER modelling are not sufficient to represent
all the concepts in such applications. To address these needs, many different
semantic data models have been proposed and some of the most important se-
mantic concepts have been successfully incorporated into the original ER model.
This chapter discusses and illustrates advanced ER modelling concepts, namely
specialization/generalization, aggregation and composition.
Context
This chapter continues to address the top-down database design concepts. Like
the previous chapters, it links closely with the other chapters on database de-
sign, Normalisation and other design topics. The chapter also has considerable
relevance for the material in the module on performance tuning, such as the
chapter on indexing, as the decisions made during database design have a major
impact on the performance of the application.
Entities
2
Relationship types
These express the number of entities with which another entity can be associated
via a relationship. The relationships that exist between two entities can be
categorised by the following:
• one-to-one
• one-to-many
3
• many-to-many
Relationship participation
4
4. Optional for both entities
Specialization/generalization
We have discussed different types of relationships that can occur between entities.
Some entities have relationships that form a hierarchy. For example, a shipping
company can have different types of ships for its business. The relationship that
exists between the concept of the ship and the specific types of ships forms a
hierarchy. The ship is called a superclass. The specific types of ships are called
subclasses.
Superclass: An entity type that represents a general concept at a high level.
Subclass: An entity type that represents a specific concept at lower levels.
A subclass is said to inherit from a superclass. A subclass can inherit from many
superclasses in the hierarchy. When a subclass inherits from one or more super-
classes, it inherits all their attributes. In addition to the inherited attributes,
a subclass can also define its own specific attributes. A subclass also inherits
participation in the relationship sets in which its superclass (higher-level entity)
participates.
The process of making a superclass from a group of subclasses is called gener-
alization. The process of making subclasses from a general concept is called
specialization.
Specialization: A means of identifying sub-groups within an entity set which
have attributes that are not shared by all the entities (top-down).
Generalization: Multiple entity sets are synthesized into a higher-level entity
set, based on common features (bottom-up).
5
As an example, let’s consider the following scenario:
Africa holds many historical artefacts in different locations. Each artefact is kept
in a specific location. A location can be a point, province, country or sub-region
of Africa.
The scenario has a specialization relationship between the location and different
specific types of locations (i.e. point, province, country and sub-region). This
specialization relationship is represented in the ER diagram below.
6
Constraints on specialization/generalization
7
• Completeness constraints
Total: Each superclass (higher-level entity) must belong to subclasses
(lower-level entity sets), e.g. a student must be postgrad or undergrad. To
represent completeness in the specialization/generalization relationship,
the keyword ‘Mandatory’ is used.
8
We can show both disjoint and completeness constraints in the ER diagram.
Following our examples, we can combine disjoint and completeness constraints.
Some members of a university are both students and staff. Not all members of
the university are staff and students.
9
A student in the university must be either an undergraduate or postgraduate,
but not both.
Method 1
All the entities in the relationship are mapped to individual tables.
Student (Regno, name)
PosGrad (Regno, supervisor)
UnderGrad (Regno, points)
Method 2
Only subclasses are mapped to tables. The attributes in the superclass are
duplicated in all subclasses.
PosGrad (Regno, name, supervisor)
UnderGrad (Regno, name, points)
This method is most preferred when inheritance is disjoint and complete, e.g. ev-
ery student is either PosGrad or UnderGrad and nobody is both.
Method 3
Only the superclass is mapped to a table. The attributes in the subclasses are
taken to the superclass.
10
Student (Regno, name, supervisor, points)
This method will introduce null values. When we insert an undergraduate record
in the table, the supervisor column value will be null. In the same way, when
we insert a postgraduate record in the table, the points value will be null.
Review question 1
Discuss the specialization/generalization relationship in ER modelling.
Review question 2
Explain the three constraints that can be applied on the specializa-
tion/generalization relationship.
Aggregation
The ‘whole’ part must be put at the end of the diamond. For example, the
Car-Engine relationship would be represented as shown below:
Composition
11
Representation of composition in ER diagrams
Review question 3
Using an example, explain the concepts of aggregation and composition.
Exercise 1
Draw the ER diagram for a small database for a bookstore. The database will
store information about books for sale. Each book has an ISBN, title, price and
short description. Each book is published by a publisher in a certain publishing
year. For each publisher, the database maintains the name, address and phone
number.
Each book is written by one or more authors. For each author, the database
maintains his/her ID, name and a short introduction. Each book is stored
in exactly one warehouse with a particular quantity. For each warehouse, the
database maintains the warehouse name, the location and the phone number.
Each book has one or more sellers, which may be either companies (corporate
vendors) or individuals (individual vendors).
For each company, the database maintains a name of the company, its address,
its phone numbers (there could be more than one phone number, each with a
number and a description) and its contact person. For each individual vendor,
the database keeps a name, a phone number and an email address. A contact
person whose company sells a book cannot be selling the same book as an
individual vendor at the same time (he/she may sell other books as an individual
seller).
What is XML?
12
Language) has become a standard for structured data interchange among busi-
nesses. It was formally ratified by the World Wide Web Consortium (W3C) in
1998. XML uses markup for formatting plain text. Markup refers to auxiliary
information (tags) in the text that give structure and meaning.
We have demonstrated how to use relational tables to represent entities and their
attributes. XML also supports the representation of entities and attributes.
In this section, we will introduce XML. Students are encouraged to study de-
tailed books for further information. One useful website for learning XML is
http://www.w3schools.com/xml/default.asp.
Element
An element is a building block of an XML document.
• All elements are delimited by < and >.
• Element names are case-sensitive and cannot contain spaces.
The representation of an element is shown below:
<Element> …. </Element>
An XML document can contain many elements, but one must be the root ele-
ment. A root element is a parent element of all other elements.
Attribute
Elements can have attributes. Attributes are specified by name=value pairs
inside the starting tag of an element:
<Element attribute = “value” >.. </Element >
All values of the attributes are enclosed in double quotes.
An element can have several attributes, but each attribute name can only occur
once.
13
<Element attribute1 = “value1” attribute2=“value2”>
14
Explanation
• <?xml version=“1.0” encoding=“UTF-8”?>: is the XML prolog.
The prolog is used to specify the version of XML and the encoding used.
It is optional, but if it appears in the document, it must be the first line
in the document.
• Customers element: Customers is the root element.
• Customer element: A Customer element represents a tuple in the Cus-
tomers table. The table has three attributes, CUSTOMER_ID, NAME
and LOCATION. In our XML, CUSTOMER_ID is represented as an
15
attribute of the Customer element. NAME and LOCATION are repre-
sented as child elements of the Customer element. Notice that we have
repeated the Customer element three times to capture the three records
in the Customer table.
The XML technology specifies the syntax for writing well-formed documents
but does not impose the structure of the document. XML document writers
are free to structure an XML document in any way they want. This can be
problematic when verifying a document. How many elements can a document
have? What elements should a document have? These questions are difficult to
answer unless we also specify the structure of the document. Document type
definition (DTD) is used to define the structure of an XML document.
DTD specifies the following:
• What elements can occur.
• What attributes an element can/must have.
• What sub-elements can/must occur inside each element, and how many
times.
DTD element syntax:
<!ELEMENT element (subelements-specification) >
DTD attribute syntax:
<!ATTLIST element (attributes) >
The DTD for the XML we defined above can be defined as shown below:
Explanation
16
• !DOCTYPE: Defines that the Customers element is the root element of
the document.
• <IELEMENT>: Defines an XML element. The first element to be
defined is the Customers element. A Customers element has one child
element, Customer, indicated in brackets. The + symbol means that the
Customer element can appear one or more times under the Customers
element. The Customer has two sub-elements, Name and Location. The
Name and Location elements have character data as a child element.
• <!ATTLIST>: Defines the attribute. The Customers element has one
attribute, customerID, of type character data.
Namespaces
XML data has to be exchanged between organisations. The same element name
may have different meaning in different organisations, causing confusion on ex-
changed documents.
Specifying a unique string as an element name avoids confusion. A better solu-
tion is to use a unique name followed by an element name.
unique-name:element-name
Adding a unique name to all element names can be cumbersome for long docu-
ments. To avoid using long unique names all over a document, we can use XML
namespaces.
17
XQuery
XQuery is a language for finding and extracting elements and attributes from
XML documents. The way SQL is to relational databases, XQuery is the query
language for XML documents. For example, to display all the names of the
customers in the XML above, our XQuery will look as follows:
for $x in /Customers/Customer
return $x/Name
Exercise 2
In chapter 3, Introduction to SQL, we introduced the EMP table. Represent
the records in the table in XML.
18
Chapter 8. Data Normalisation
Table of contents
• Objectives
• Introduction
• Context
• Determinacy diagrams
– Determinants and determinacy diagrams
– Direct dependencies
– Transitive (indirect) dependencies
– Composite determinants and partial dependencies
– Multiple determinants
– Overlapping determinants
– Exploring the determinant of ‘fee’ further
• Finding keys using dunctional dependency
• Normalisation
– Un-normalised data
∗ Problems with un-normalised data
– First normal form
∗ Determinacy diagram for first normal form
∗ Insertion anomalies of first normal form
∗ Arbitrary selection of a primary key for relation in 1NF
∗ Amendment anomalies of first normal form
∗ Deletion anomalies of first normal form
– Second normal form
∗ Insertion anomalies of second normal form
∗ Amendment anomalies of second normal form
∗ Deletion anomalies of second normal form
– Third normal form
∗ Summary of the first three normal forms
• Review questions
• Discussion topic
• Application and further work
Objectives
1
• Describe and apply understanding of three normal forms for relations.
• Convert un-normalised data into first normal form relations, so that data
items contain only single, simple values.
• Derive second normal form relations by eliminating part-key dependencies.
• Derive third normal form relations by removing transitive dependencies.
Introduction
In parallel with this chapter, you should read Chapter 13 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
Normalisation stands on its own as a well-founded approach to database design.
In addition, normalisation links closely with the material covered in the pre-
vious two chapters on entity-relationship modelling. However, the additional
flexibility of normalised designs comes at a price — a well-normalised design
tends to perform poorly when subjected to large volumes of transactions. For
this reason, there are trade-offs to be made between the extent to which a design
is normalised and the performance response of the implemented system. The
information in this chapter has to be applied carefully, in light of the informa-
tion given in a later chapter on database design relating to de-normalisation
and physical design.
Why should we attempt to normalise data? Un-normalised data often contains
undesirable redundancy (and its associated ‘costs’ in storage, time and multiple
updates), and different degrees of normalisation (i.e. different normal forms) can
guarantee that certain creation, update and deletion anomalies can be avoided.
Context
This chapter covers the well-known approach to database design known as data
normalisation. It introduces a bottom-up technique for the development of flexi-
ble database applications. This bottom-up approach complements the top-down
entity-relationship technique presented in the first database design chapter, as
the two approaches can be used to cross-check the extent to which the overall
design satisfies the requirements of the application. By themselves, database
designs arrived at through the normalisation process, while providing great flex-
ibility, tend to perform very slowly. The complementary bottom-up and top-
down methodologies, in practice, often reveal different information, and can be
applied using different fact-finding techniques. For these reasons (of efficiency
and the benefits of multiple viewpoints to get a better final design), a balanced
approach to database design will use both approaches.
2
Determinacy diagrams
It might be the case that there are performers who share the same family name
3
(for example, a family of actors). Each member of the family who is an actor
will have a unique performer-id (as the attribute performer-id is the primary
key), but there may be more than one person with that particular name. The
performer-name would not make a suitable choice for primary key for this reason.
The performer-id uniquely determines the performer-name, but a performer-
name may indicate more than one performer-id.
In a similar way, there may be more than one performer of a particular type;
the performer-id will identify the performer-type for that specific individual. It
is likely that any one location may have more than one performer based there;
the location of any particular performer can be determined by means of the
performer-id as the primary key. There are several possibilities for considering
how the fee to a performer for a booking at a venue might be calculated, and
these might include:
• flat rate fee for all performers for all venues
• fee negotiated with performer
• fee depends on performer’s location
• fee depends on location of venue
• fee depends on performer type
• fee depends on date of booking
• fee depends on a combination of factors (e.g. performer and agent)
The method by which the fee is calculated will affect the way the data is mod-
elled; this is because the value of the fee can be linked to a number of other at-
tributes, and might not be determined by the performer-id alone as the primary
key. The determinacy diagrams may be different depending on the particular
method of calculating the fee.
If we consider some of the possibilities outlined above, we can identify the de-
pendencies that affect the fee and create a determinacy diagram.
Direct dependencies
An example to illustrate direct dependencies might be: flat rate fee for all
performers for all venues.
In this case, the fee could be regarded as another attribute of each performer,
or could be linked to a performance (the number of performances determining
the total amount earned). The fee could be regarded as an entity in its own
right. We would need to take into account what would happen if the fees were to
change. Would all fees change to the same new value? What would determine
whether one performer earned a different fee from another? The answers to
these questions would reveal underlying dependencies between the data.
4
If we assume that all performers are paid the same fee, and when the fee is
changed it affects all performers in exactly the same way, we can identify the
fee as a separate entity.
The value of the fee would then depend on the fee code. The fee is directly
dependent on the fee code.
(Note that we would not want to insert the exact fee as a value for all performers
because of the implications of updating the value when the fee changes.)
5
On the other hand, if the fee applies to all venues in the same area, venues must
be identified as belonging to specific areas in which a given fee applies. This is
an indirect dependency, also known as a transitive dependency.
Important
Transitive (indirect) dependency
Sometimes the value of an attribute is not determined directly from the primary
key, but through the value of another attribute which is determined by the
primary key; this relationship is known as a transitive dependency.
Another example of a transitive dependency
Consider the following attributes: fee depends on performer type.
Here the fee depends on whether the performer is an actor, dancer, singer or
some other type of performer. The different types of performer need to be
6
identified, and a fee specified in each case. The value of the fee does not depend
directly on the performer-id, but is linked to the type of performer. This is
another example of an indirect (or transitive) dependency.
Sometimes the determinant is not a single attribute, but made up of two or more
attributes. Consider the following: fee depends on a combination of factors
(e.g. performer and agent).
Important
Composite determinant
If more than one value is required to determine the value of another attribute,
the combination of values is known as a composite determinant.
If the fee is determined by more than one factor, both these elements must be
taken into account. This is shown in the determinacy diagram on the right by
the arrow including both the performer-id and the agent-id as the determinant
items on which the fee depends. The attributes performer-id and agent-id are
known as composite determinants.
7
Where every attribute in a primary key is required as a composite determinant
for an attribute, the attribute is said to be fully functionally dependent on the
key.
Note that the attributes that depend only on performer-id (such as the name,
type and location of each performer), or agent-id (such as the agent and location
of each agent) are shown linked directly to the appropriate key. If we take
performer-id and agent-id as the key, we can say that the performer and agent
details are partially dependent on the key. Partial dependency is when an
attribute is functionally dependent on a proper subset of the key.
Important
8
Partial dependency
If the value of an attribute does not depend on an entire composite determinant,
but only part of it, that relationship is known as a partial dependency.
Multiple determinants
It is possible that there may be more than one attribute that can act as a
determinant for other attributes. This is a slightly different situation from that
of candidate keys, as not all determinants are necessarily candidate keys. If
we wish to describe an event, we may find that there is a special relationship
between the attributes event-id and event-name; each event will have a unique
identification number, and also an unique name. The relationship between the
event-id and the event-name is one-to-one. The better choice of primary key for
the event would be event-id, which is a unique identification number.
The attribute event-name, while unique to each event, would not make such a
good choice as the key because there can be problems in obtaining an exact
match (e.g. “Quicktime”, “Quick time” and “Quick Time” would be regarded
as different names).
We can show dependencies between the attributes event-id, event-name and
event-type on a determinacy diagram.
Each event would have values for the attributes event-id, event-name and event-
type.
In the determinacy diagram below, we can see that event-id is a determinant
for the other two attributes, event-name and event-type.
9
The determinacy diagram shows that the attribute event-name is also a deter-
minant for the other two attributes, event-id and event-type. This is because
there is a one-to-one relationship between event-id and event-name.
Overlapping determinants
There are sometimes cases where there is more than one combination of at-
tributes that uniquely identifies a particular record. This means that the de-
terminants have attributes in common. In certain circumstances, there may be
a special relationship between the attributes, so that each uniquely determines
the value of the other.
An example of this may be where each module in a degree programme has a
unique module code and a unique module name. It would be possible to use
either the module code or the module name as the determinant. In addition, the
module code determines the module name, and the module name determines
the module code.
10
In the context of our example relating to performers, agents, venues and events,
we will also need to be able to identify bookings. We find that each booking
can be identified by a different combination of attributes.
When a booking is made, the performer-id, agent-id, venue-id and event-id are
all required in order to specify a particular event occurring on a given date. This
also needs to be represented using a determinacy diagram.
Each booking can be identified by the primary key, which is shown on the right
as a combination of the attributes performer-id, agent-id, venue-id and event-id.
Note that in this instance, the arrow (coming from the outer box) indicates that
all four key attributes are used to identify the booking date.
We know that each event can be identified either by the event-id or the event-
name; this means that we could have an alternative representation in the deter-
minacy diagram, substituting the attribute event-name for event-id as part of
the combined key.
An alternative primary key for each booking would be a combination of
performer-id, agent-id, venue-id and event-name.
11
Here again, the arrow (coming from the outer box) indicates that all four key
attributes are used to identify the booking date.
Here we have an overlapping key. The attribute event-name is a determinant,
although it is not a candidate key for its own data. We would not want to use
the event-name as a primary key, as it can present a problem in identifying the
relevant tuple if the spelling is not exactly the same as in the relation.
The determinacy diagram also shows the relationship between the attributes
event-id and event-name.
12
Exploring the determinant of ‘fee’ further
If a performer negotiates the same fee for all bookings, the fee depends on
the performer-id, as each performer will have their own fee. This is a direct
dependency.
13
Where the value of the fee depends the date of the booking, the value of the fee
cannot be known until details of the booking are available.
This means that a booking must be made by an agent for a performer at a venue
in order for the fee to be determined. It may be that a higher fee is paid in the
summer months than at other times of the year.
The booking date will be determined by the composite determinant made up
from the agent-id, performer-id and venue-id (as all three are involved). The
booking date itself then determines the fee. There is therefore an indirect (or
transitive) dependency between the composite key and the fee.
Functional dependency (FDs) helps to find keys for relations. To identify all
candidate keys, check whether each determinant uniquely identifies tuples in
the relation. Let’s define another important concept called attribute closure.
Attribute closure
The closure of X, written X+, is all the attributes functionally determined by X.
That is, X+ gives all the values that follow uniquely from X. Attribute closure
is used to find keys and to see if a functional dependency is true or false.
To find the closure of X+, follow the following steps:
• ans = X
• For every Y→Z such that Y � ans, add Z to ans
• Repeat until no more changes to X+ are possible
14
For example, given a relation R, such that
R(S, C, P, M, G, L, T)
FDs {SC → PMG, SL → C, CT → L, TL → C, SP → C
Can we answer the following two questions?
Is SL a key for R?
• Start with ans = {SL}
• Using 2nd FD, SL functionally determines C, so we add C to the ans, ans
= {SLC}
• Using 1st FD, SC functionally determines PMG, so we add PMG to the
ans, ans = {SLCPMG}
• No more attributes can be added because no subset of the ans functionally
determines other attributes, so (SL)+ is SLCPMG
Is SL a key for R? No, because the closure of SL is not equal to all the attributes
in R
Does SL → PG?
Yes, because PG is in (SL)+
Normalisation
15
If we consider the data before it has undergone the normalisation process, we
regard it as un-normalised.
Un-normalised data
16
To accommodate the size of the table, some headings have been shortened as
shown below:
• P-id: performer-id
• Perf-name: performer-name
• Perf-type: performer-type
• Perf-Loc’n: performer-location
• A-id: agent-id
• Agent-Loc’n: agent-location
• V-id: venue-id
• Venue-Loc’n: venue-location
• E-id: event-id
17
The performer Eagles (performer-id 112) has bookings made by more than one
agent, and therefore there are multiple entries for agent details, rather than a
single entry.
18
First normal form
19
Agent details
The information about each agent depends on the agent-id as the primary key.
20
Note that the arrow from agent-id indicates that the agent attributes depend
only on agent-id as the key attribute, and not performer-id, venue-id or event-id.
Venue details
The primary key, venue-id, determines the name and location of each venue.
Note that the venue-name depends only on the venue-id as shown by the ar-
row in the diagram. The attributes performer-id, agent-id and event-id do not
determine the venue-name.
Event details
We can consider the representation of events from two angles. We have two
attributes which can be used as determinants: event-id and event-name. We
can examine each in turn using a determinacy diagram, and then show the
relationships between all three attributes (event-id, event-name and event-type)
on a single determinacy diagram.
Event-id as the determinant
The primary key, event-id, determines the name and type of each event. There
is a one-to-one relationship between event-id and event-name; either could be
used to identify the other.
21
Note that the event-name depends only on the event-id as shown by the arrow
in the diagram. The attributes performer-id, agent-id and venue-id do not
determine the event-name.
Event-name as the determinant
There is a special relationship between the attributes event-id and event-name;
each event-id and each event-name is unique.
This means that we could use either the event-id or the event-name as the
determinant for locating details about an event.
The determinacy diagram below shows the event-name being used as the deter-
minant, although we would not want to use it as the primary key, as names can
be difficult to get exactly right.
22
Event-id and event-name as determinants
We can show the special relationship between event-id and event-name by arrows
illustrating the link in each direction.
23
As either event-id or event-name can determine the event-type, there are links
between event-id and event-type, and also between event-name and event-type.
Booking detail
In addition to the performers, agents and venues, we need to be able to identify
the bookings that have been made. When a booking is made, the performer-id,
agent-id, venue-id and event-id are all required in order to specify a particular
event occurring on a given date. This also needs to be represented using a
determinacy diagram.
Each booking can be identified by the primary key, which is shown on the right
as a combination of the attributes performer-id, agent-id, venue-id and event-id.
Note that in this instance, the arrow (coming from the outer box) indicates that
all four key attributes are used to identify the booking date.
We know that each event can be identified either by the event-id or the event-
name; this means that we could have an alternative representation in the deter-
minacy diagram, substituting the attribute event-name for event-id as part of
the combined key.
24
An alternative primary key for each booking would be a combination of
performer-id, agent-id, venue-id and event-name.
Here again, the arrow (coming from the outer box) indicates that all four key
attributes are used to identify the booking date.
Here we have an overlapping key. The attribute event-name is a determinant,
although it is not a candidate key for its own data. We would not want to use
the event-name as a primary key, as it can present a problem in identifying the
relevant tuple if the spelling is not exactly the same as in the relation.
We can show the overlapping nature of the keys for the booking details in a
determinacy diagram.
The determinacy diagram below shows that the booking date could be located
through a primary key constructed from the attributes performer-id, agent-id,
venue-id and event-id, or by means of a primary key combining the attributes
performer-id, agent-id, venue-id and event-name.
The determinacy diagram also shows the relationship between the attributes
event-id and event-name.
25
It is not common to find overlapping keys; it is more usual to have a unique iden-
tifier which distinguishes between different items (for example, the performer-id
will distinguish between different performers who may happen to have the same
name). At this point in the normalisation process, overlapping keys do not
present a problem, but they will be dealt with at a later stage. We will use the
event-id in preference to the event-name for the time being, but we will need to
remember the special relationship that exists between these two attributes.
26
The combined determinacy diagram (above) for first normal form shows that:
• The performer attributes (name, type, location and fee) depend only on
the key performer-id.
• The agent attributes (name and location) depend only on the key agent-id.
27
• The venue attributes (name and location) depend only on the key venue-
id.
• The event attributes (name and type) depend only on the key event-id
(we will examine the relationship between event-id and event-name later).
• The booking details depend on all four key attributes: performer-id, agent-
id ,venue-id and event-id.
The full determinacy diagram for first normal form, showing the overlapping
keys, is shown below:
The result of converting an un-normalised table of data into first normal form is
to remove repeating values, so that each line in the table has the same format,
28
with only one value in each column for each row. This means that there will
be only one value for each attribute for each tuple in a relation in first normal
form.
Where more than one booking has been made for a performer, each booking is
now given as a separate entry.
The original table of data has been converted into a relation in first normal
form, as shown below. The relation has the same structure as the determinacy
diagram, both being in first normal form, and exhibiting the following charac-
teristics:
• All performers have a performer-id as the primary key.
• Details about agents can be determined from the primary key agent-id.
• Any venue can be identified by the venue-id as the primary key.
• All events can be determined by event-id as the primary key.
• Where a booking has been made, the key attributes performer-id, agent-
id, venue-id and event-id all have values, which combine to identify each
particular booking as a composite (or compound) primary key.
We can now convert our table of un-normalised data into a relation in first
normal form (1NF). Note that there is at most a single value at the intersection
of each row and column. This process is sometimes known as ‘flattening’ the
table.
Table of relation in first normal form (1NF)
29
We can see that the relation in first normal form will still exhibit some problems
when we try to insert new tuples, update existing values or delete existing tuples.
This is because there is no primary key for the whole table, although each major
component has its own key (performer, agent, venue, event and booking).
30
• Agent-id
• Venue-id
• Event-id
• Performer-id, agent-id, venue-id and event-id combined
Would the attribute performer-id make a suitable key for the relation in 1NF?
The attribute performer-id is the primary key for performers, but it cannot be
used as the key for the whole relation in first normal form as there are some
cases where there is no relevant value, as shown in the following examples:
No performer-id for Shaw
The venue Shaw (venue-id 62) has not been used for any bookings, and therefore
has no performer-id associated with it that could be used as a key.
31
No agent-id for Tan
The actor Tan (performer-id 149) has no bookings and therefore no agent-id is
available to be used as a key.
Note that the performer-id as primary key for performers distinguishes between
149 Tan the actor, and 143 Tan the singer (who does have a booking).
No agent-id for New Dawn
There are no bookings for the event New Dawn (event-id 938), and therefore
there is no agent-id that could be used as a key.
We can conclude that the attribute agent-id would not make a suitable key for
the relation in first normal form.
Would the attribute venue-id make a suitable key for the relation in 1NF?
The attribute venue-id is the primary key for all venues, but it cannot be em-
ployed as the key for the whole relation in first normal form as there are instances
where no value has been allocated, for example:
No venue-id for Tan
The actor Tan (performer-id 149) has no bookings at a venue and therefore
there is no venue-id that can be used as a key.
32
No venue-id for Webb
The agent Webb (agent-id 1377) has made no bookings, and is therefore not
associated with any venue-id that could be used as a key.
We can conclude that the attribute venue-id would not make a suitable key for
the relation in first normal form.
Would the attribute event-id make a suitable key for the relation in 1NF?
The attribute event-id is the primary key for events (although the event-name
could also be used as the primary key). The examples below demonstrate that
the event-id cannot be used as the key for the whole relation in first normal
form, as there are cases where there is no value for the event-id.
No event-id for Tan
The actor Tan (performer-id 149) has no bookings at an event and therefore
there is no event-id that can be used as a key.
33
No event-id for Shaw
The venue Shaw (venue-id 62) has not been used for any bookings, and therefore
there is no event-id associated with it that could be used as a key.
We can conclude that the attribute event-id would not make a suitable key for
the relation in first normal form.
Would the combined attributes performer-id, agent-id, venue-id and event-id
make a suitable key for the relation in 1NF?
The combined attributes performer-id, agent-id, venue-id and event-id serve as
the primary key for all bookings, but this combination cannot be employed as
the key for the whole relation in first normal form as there are entries where the
key would be incomplete, for example:
No agent-id, venue-id or event-id for Tan
34
The actor Tan (performer-id 149) has no bookings made by an agent at a venue
for an event and therefore there is no complete combined key value.
35
We can conclude that the combination of the attributes performer-id, agent-id,
venue-id and event-id would not make a suitable key for the relation in first
normal form.
There is no obvious choice for a primary key. The attributes that we might
expect to be able to use as a key (such as performer-id, agent-id, venue-id
and event-id) are unsuitable because a value is not always available, and it is
not possible to have a key field with a null (or empty) value (because of the
requirements of entity integrity).
We would not be able to insert details about new venues that have not yet been
used for a booking, as they too will not have a performer-id associated with
them. In this instance, it would not be possible to retain the tuple on the venue
Shaw (venue-id 62).
36
We would not be able to insert details about new events that had not yet been
booked, as any such event will not have a performer-id associated with it. This
means that we would not be able to retain the tuple on the event New Dawn,
as it has not been used for a booking.
37
Problems if agent changes location
The agent Lee (agent-id 1504) has made bookings for more than one performer,
at more than one location, so if agent Lee were to move to another location
it would be necessary to change details of the agent location in more than one
place.
38
in the relation in first normal form. If the drama 952 Gold Days were to be
rewritten to include songs, it would then need to be reclassified as a musical,
and this information would need to be updated for every booking for that event.
Even if the new musical production were allocated a new event-id, the change
would still need to be reflected in every booking of the event.
39
It is worth noting that if the details for Eagles are removed from the relation, all
three occurrences would have to be removed; there would be problems of data
integrity and consistency if some were omitted.
40
not cause a loss of data about performers, agents or venues.
The other booking for event 926 New Year for performer Chong was made by
agent 1478 Burns at venue 79 Festive. The agent Burns and the venue Festive
are also involved in other bookings, but this was the only booking for performer
Chong. If this tuple is deleted, we will lose all details concerning the performer
129 Chong.
The process of converting a relation from first normal form into second normal
form is the identification of the primary keys, and the grouping together of
attributes that relate to the key. This means that attributes that depend on dif-
ferent keys will now appear in a separate relation, where each attribute depends
only on the key, whether directly or indirectly. The purpose of converting the
relation into second normal form is to resolve many of the problems identified
with first normal form.
Important
Second normal form (2NF)
For a relation to be in second normal form, all attributes must be fully func-
tionally dependent on the primary key. Data items which are only partial
dependencies (as they are not fully functionally dependent on the primary key)
need to be extracted to form new relations.
For our performer case study, the single relation in first normal form (1NF) is
transformed into four relations in second normal form (working from the 1NF
determinacy diagram): performers, agents, venues and bookings.
Performer details
41
All data relating to performers is now grouped separately from agents, venues,
events and bookings. The determinacy diagram for performer details gives us a
performer relation in second normal form. The primary key for the performer
relation is performer-id, and the other attributes are names, performer-type, fee
and location.
The creation of an independent new relation for performers has the following
benefits, which resolve the problems encountered with the single relation in first
normal form:
• New performers can be inserted even if they have no bookings.
• A single amendment will be sufficient to update performer details even if
several bookings are involved.
• The deletion of a performer record will not result in the loss of details
concerning agents, venues or events, as performers, agents, venues and
events are now stored independently of each other.
42
Agent details
The information concerning agents is now stored separately from that of per-
formers, venues and bookings. The determinacy diagram for agents gives us
a relation for agents in second normal form. The primary key for the agents
relation is agent-id, and the remaining attributes are name and location.
The new relation for agents has the following benefits, which resolve the prob-
lems encountered with the single relation in first normal form because the new
relation is independent from performers, venues and bookings:
• New agents can be inserted even if they have made no bookings.
• A single change will be enough to update agent details, even if several
bookings are involved.
• Agent details will now no longer be lost if a performer is deleted, as per-
formers, agents, venues and events are now stored independently of each
other.
43
Venue details
The creation of a new relation solely to store the details of venues has the
following effects, which resolve the problems identified with the single relation
in first normal form:
• Details of a new venue can be inserted, whether or not it has been booked.
• If the name of the venue is changed, the alteration only needs to be made
once, in the venue relation, not for every booking of that venue.
• If details of a performer are deleted, and the performer had the only book-
ing at a particular venue, details of the venue will not be lost.
44
Event details
• A new relation is created to hold details of individual events.
• Details of a new event can be inserted, whether or not it has been booked.
• If the name of the event is changed, the alteration only needs to be made
once, in the event relation, not for every booking of that event.
• If details of a performer are deleted, and the performer had the only book-
ing of a particular event, details of the event will not be lost.
The determinacy diagram could be represented as follows:
45
Relation in second normal form: Events
Booking details
46
Every time a booking is made, the details are recorded in the relation called
Bookings. There is no need to store all the details of the performer, agent,
venue and event for each booking that is made, as this information can be
acquired from the relevant relation for performers, agents, venues and event.
The information that is needed for the Booking relation is the performer-id,
agent-id, venue-id and event-id (these four attributes together form the key for
this relation), and the booking date.
Another possible key for the Booking relation involves the attributes performer-
id, agent-id, venue-id and event-name; as three of the four attributes in this key
are the same as the first key described for this relation, we have an example
of overlapping keys. Note that the overlapping keys are not resolved in the
transformation from first normal form to second normal form, as event-id and
event-name are part of each key. Conversion from first to second normal form
extracts all non-key attributes which are only partially dependent on the key,
and as such event-id and event-name remain as they are part of the key.
47
The determinacy diagram below shows the overlapping keys for the Bookings
relation, and also illustrates the dependencies between the attributes event-id
and event-name:
48
Relation in second normal form: Bookings
49
Amendment anomalies of second normal form
If performer Stokes (performer-id 126), who is the only comedian in the rela-
tion, retrains and changes career to become a magician, we will then lose the
information that comedians are paid a fee of 90 (in whatever currency is used).
Stokes will then be paid 72, which is the fee for all magicians.
We would also find an amendment anomaly if the fee paid to a particular type
of performer changed. If all singers were granted a new rate, all tuples relat-
ing to singers would need to be updated, otherwise the data would become
inconsistent.
All these anomalies are caused by the fee paid to the performer being dependent
on the performer-type, and not directly on the primary key performer-id. This
indirect, or transitive, dependency can be resolved by transforming the relations
in second normal form into third normal form, by extracting the attributes
involved in the indirect dependency into a separate new relation.
50
Third normal form
The reason for converting a table from second normal form into third normal
form is to ensure that data depends directly on the primary key, and not through
some other relationship with another attribute (known as an indirect, or transi-
tive, dependency).
Important
Third normal form (3NF)
A relation is in third normal form if there are no indirect (or transitive) depen-
dencies between the attributes; for a relation to be in third normal form, all
attributes must be directly dependent on the primary key.
An indirect dependency is resolved by creating a new relation for each entity;
these new relations contain the transitively dependent attributes together with
the primary key.
The conversion of a relation into third normal form will resolve anomalies iden-
tified in second normal form.
We now have six relations in third normal form: Performers, Fees, Agents,
Venues, Events and Bookings.
Performer details
As before, the name and location of each performer depends on the performer-
id. We noticed in second normal form that there were problems associated
with having the fee contained within the performer relation, as the value of the
fee depended on the performer-type and not on performer-id, demonstrating a
transitive dependency.
One solution would be to create a new relation with performer-type as the key,
and the fee as the other attribute; performer-type would also remain in the
relation Performers, but the fee would be removed.
The relations for Performer and Fees follow the determinacy diagrams below.
51
Relation in third normal form: Performers
52
A possible problem with this approach is the format of data entry of new per-
formers; if “ACROBAT”, “Acrobat” or “acrobat” are entered, they might not
be recognised as the same performer-type. In addition, if an error is made and
“arcobat” is entered, this may not be recognised. To deal with this problem, we
have used a code for performer-type in the Performer relation. This code is then
used as the key in the Fees relation, and the other attributes are performer-type
and the fee, both of which depend on the performer-code as primary key. (We
could have introduced the performer-code into the table of un-normalised data.)
The relations for Performer and Fees follow the determinacy diagrams below.
53
Relation in third normal form: Fees
Agent details
There is no change to the determinacy diagram for Agents, as this is already in
third normal form (there were no transitive dependencies). The relation follows
the determinacy diagram below.
54
Relation in third normal form: Agents
Venue details
The data on Venues is already in third normal form as there were no transitive
dependencies; there are therefore no changes to the determinacy diagram shown
below, and the relation which follows.
55
Relation in third normal form: Venues
Event details
The Events relation is already in third normal form as there are no transitive de-
pendencies. There is the special relationship that exists between the attributes
event-id and event-name, which does not present a problem within the Events
relation itself, but creates difficulties in the Bookings relation because of the
overlapping key which results.
56
Relation in third normal form: Events
57
Booking details
The relation Bookings, with its composite determinants of performer-id, agent-
id, venue-id and event-id, or performer-id, agent-id, venue-id and event-name,
is already in third normal form as there are no transitive dependencies. The
determinacy diagrams and the associated relation are illustrated below.
This determinacy diagram illustrates the combination of performer-id, agent-id,
venue-id and event-id used as the determinant for the Bookings relation:
The next determinacy diagram shows the choice of performer-id, agent-id, venue-
id and event-name as the determinants for the Bookings relation.
58
The determinacy diagram below combines the previous two determinacy dia-
grams to show the overlapping keys for the Bookings relation, and illustrates
the dependencies between the attributes event-id and event-name.
59
The details of the Bookings relation are shown below.
Relation in third normal form: Bookings
60
• Identify any attributes that are not directly determined by the key (this
is sometimes called ‘removing transitive dependencies’).
We shall see in a later chapter on database design that there is further work
that can be done to normalise sets of relations, and alternative approaches
to reaching third normal form (3NF). However, 3NF represents a point where
we have gained a significant degree of flexibility in the design of a database
application, and it is a point at which normalisation of many applications is
considered to be complete.
Review questions
61
1. How is each project identified?
Each project is to be allocated a unique project number. This number
need only be two digits long, as there will never be more that 99 projects
to be stored at one time.
2. Is it required to store both expected and actual completed start and finish
dates for projects?
Yes, all four data items are required, and the same four data items are
required for tasks as well.
3. How are tasks identified?
They also have a unique task number, which again can be safely limited
to two digits. So each task is identified by the combination of the project
number of the project within which it occurs, and its own task number.
4. Do projects have many tasks?
Yes, each project typically consists of about 10 tasks.
5. Can a task be split between more than one project?
No, a task is always considered to take place within one project only.
6. Are employees assigned to projects, or to specific tasks within projects?
How many tasks can an employee work on at one time?
Employees are assigned to specific tasks within projects, so each employee
can work on a number of tasks at one time. Furthermore, each task has
an employee allocated to it who is specifically responsible for its successful
completion. Each project has a project leader responsible for that project’s
successful completion.
7. What is required to be stored about contract staff pay?
Each staff member is paid at a monthly rate, that rate being determined
entirely by the highest qualification held by the staff member. We simply
need to record the appropriate qualification for each staff member, and
the monthly rate at which they are paid, plus the start and end dates of
their current contact.
Review question 3
Remove any part-key dependencies from the relation produced in question 2 to
produce a set of second normal form relations.
Review question 4
From the second normal form design in the previous question, produce a set of
third normal form relations, by removing any indirect or transitive dependencies.
Review question 5
62
Explain the role of determinacy diagrams in database application development.
Review question 6
What is a repeating group? Why is it necessary to remove repeating groups in
Relational database design?
Review question 7
Explain the term ‘part-key dependency’, and its role in normalisation.
Review question 8
What is the difference between second and third normal form relations?
Discussion topic
As mentioned at the start of the review questions, the process of eliciting in-
formation about the requirements of computer applications is an extremely im-
portant and potentially difficult one. Among the techniques that are commonly
used to capture the requirements of users and other stakeholders in the system
are:
• Interviews, which vary in different organisations and between individuals
in the amount of planning and pre-determined questions
• Questionnaire surveys, in the following formats: written, e-mail or web-
based
• Brainstorming
• Direct observation of users carrying out tasks
All of these techniques and more can play a useful role in capturing requirements,
and each technique has particular strengths and weaknesses. You are encouraged
to discuss with other students the strengths and weaknesses you consider each
of the techniques listed above have in obtaining accurate and comprehensive
information about the requirements for a new computer application. You should
include in the discussion any experiences you have had yourself of good or bad
practice in the process of requirements capture.
You are encouraged to consider the strengths and weaknesses of the application
developed in the review questions.
Firstly, identify the additional flexibility gained by each successive stage of the
normalisation process. That is, clarify the sorts of data manipulation that can
be carried out in the more normalised versions of the design, compared to the
un-normalised design.
63
Secondly, consider to what extent this extra flexibility is likely to be useful
to the business owner, and whether it is worth the overhead of managing the
additional tables.
64
Chapter 9. Advanced Data Normalisation
Table of contents
• Objectives
• Context
• Recap
– Introduction
– Before starting work on this chapter
– Summary of the first three normal forms
– Third normal form determinacy diagrams and relations of Performer
∗ Case study
• Motivation for normalising beyond third normal form
– Why go beyond third normal form?
– Insertion anomalies of third normal form
– Amendment anomalies of third normal form
– Deletion anomalies of third normal form
• Boyce-Codd and fourth normal form
– Beyond third normal form
– Boyce-Codd normal form
– Fourth normal form
– Summary of normalisation rules
• Fully normalised relations
• Entity-relationship diagram
• Further issues in decomposing relations
– Resolution of the problem
• Denormalisation and over-normalisation
– Denormalisation
– Over-normalisation
∗ Splitting a table horizontally
∗ Splitting a table vertically
• Review questions
• Discussion topic
Objectives
1
• Describe how denormalisation can be used to improve the performance
response of a database application.
Context
This chapter relates closely to the previous two on database design. It finalises
the material on normalisation, demonstrates how a fully normalised design can
equally be represented as an entity-relationship model, and addresses the impact
that a target DBMS will have on the design process. The issues relating to
appropriate choices of DBMS-specific parameters, to ensure the efficient running
of the application, relate strongly to the material covered in the chapters on
indexing and physical storage. Information in all three chapters can be used
in order to develop applications which provide satisfactory response times and
make effective use of DBMS resources.
Recap
Introduction
In parallel with this chapter, you should read Chapter 14 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
In this concluding database design unit, we bring together a number of advanced
aspects of database application design. It begins by extending the coverage of
data normalisation in an earlier chapter, describing Boyce-Codd normal form (a
refinement of the original third normal form) and providing a different view of
how to generate a set of relations in third normal form. The chapter then looks
at a number of important issues to be considered when decomposing relations
during the process of normalisation. Finally, the important topic of physical
database design is included, which shows the impact that DBMS-specific pa-
rameters can have in the development of an application. Many of these con-
siderations have a direct impact on both the flexibility and the performance
response of the application.
2
• Fully functional dependency
• Partial functional dependency
• Direct dependency
• Transitive (indirect) dependency
• Determinants
• Determinant
• Determinacy diagrams
• Normal forms
• Un-normalised form (UNF)
• First normal form (1NF)
• Second normal form (2NF)
• Third normal form (3NF)
In this chapter we shall be extending the work from the previous chapter on the
data for the Performer case study. As a starting point, we shall first present the
determinacy diagrams and the third normal form relations developed for this
case study.
3
Case study
Determinacy diagram: Performers and Fees
4
Relation in third normal form: Fees
5
Determinacy diagram: Venues
6
Determinacy diagram: Events
7
Determinacy diagrams: Bookings
8
9
The determinacy diagram below combines the previous two determinacy dia-
grams to show the overlapping keys for the Bookings relation, and illustrates
the dependencies between the attributes event-id and event-name:
10
Motivation for normalising beyond third normal form
As we shall explore in this section, under certain circumstances there are anoma-
lies that can occur for data that meets all the requirements for third normal form.
Once these anomalies were identified and understood, database researchers de-
veloped the further normal forms we shall explore in this chapter.
There are no true insertion anomalies in the Bookings relation in third normal
form; the details about each performer, agent, venue and event are also held in
separate relations specifically for those entities, but there is data redundancy.
Relation in third normal form: Bookings
11
We can see that there is data redundancy in the Bookings relation, as every time
a particular event is involved in a booking, both the event-id and the event-name
need to be inserted into the Bookings relation.
Strictly speaking, we do not need to have both event-id and event-name in the
Bookings relation, as each determines the other. If a mistake were to be made
while inserting a new tuple, so that the event-id and the event-name did not
match, this would cause problems of inconsistency within the database. The
solution is to decide on one of the two determinants from the Events relation as
part of the composite key for the Bookings relation.
We have noted that the event-id and the event-name determine each other within
the Events relation, and this in turn creates overlapping keys in the Bookings
relation. If the relationship between event-id and event-name were to break
down, and a new event happened to have the same name as another event with
a different event-id, this could create problems in the Bookings relation.
Performer Scenario 2
We can refer to a slightly altered database design as ‘Performer Scenario 2’, in
12
order to demonstrate the effects of overlapping keys.
An issue that we need to examine is in the context of this slightly different
database. In the example we having been using, the Events relation contains
event-id, event-name and event-type. We can see that the performer-type in the
Performers relation matches the event-type in the Events relation (e.g. actors
performing in dramas, singers performing in musicals). If we now consider that
the database holds only event-id and event-name as details about each event,
this would affect the structure of the database.
Now, if we were to attempt to insert details about an event which had not yet
been booked, we would not be able to do so as we would have an incomplete
key in the Bookings relation. An event which has not been booked would have
an event-id and an event-name, but no other attributes would have a value as
there has been no booking.
If there were a change to the name of a particular event, this would need to be
reflected in every booking involving that event. Some events may be booked
many times, and if the change to the name of an event is not updated in each
case, we would again find problems with maintaining consistent information in
the database.
Here too, the solution is to identify either event-id or event-name as the deter-
minant from the Events relation, so that the other of these two attributes is
stored once only in the Events relation.
13
Deletion anomalies of third normal form
In this section we introduce two new normal forms that are more ‘strict’ than
third normal form. For historical reasons, the simple numbering of first, second
and third deviates before getting to fourth. The two new normal forms are
called:
• Boyce-Codd normal form
• Fourth normal form
14
Important
Boyce-Codd normal form (BCNF)
A relation is in Boyce-Codd normal form if all attributes which are determinants
are also candidate keys.
Boyce-Codd normal form is stronger than third normal form, and is sometimes
known as strong third normal form.
Transformation into Boyce-Codd normal form deals with the problem of over-
lapping keys.
An indirect dependency is resolved by creating a new relation for each entity;
these new relations contain the transitively dependent attributes together with
the primary key.
We know that we can identify a booking by means of the attributes performer-id,
agent-id, venue-id and event-id, as shown in the determinacy diagram below.
We also know that we can identify a booking by using the attributes performer-
id, agent-id, venue-id and event-name, shown in the next determinacy diagram.
15
When we combine the two determinacy diagrams shown above, we can see that
we have an example of overlapping keys:
16
The details of the Bookings relation are shown later in this section.
Although overlapping keys are rare in practice (and some examples may appear
rather contrived), we need to be aware that they can occur, and how to deal with
them. The solution is simple: we need to decide on a single choice of attributes
so that we have only one primary key. We know that event-name would not
be an ideal choice of primary key. This is because it can be difficult to get
names exactly right (e.g. “Quicktime” is not identical to “Quick Time”), and
it may be coincidence rather than a rule that there is a one-to-one relationship
between event-id and event-name (the relationship might break down). The
choice of attribute to appear in the primary key is therefore event-id rather
than event-name.
In Boyce-Codd normal form, we have six relations: Performers, Fees, Agents,
Venues, Events and Bookings. The structure of the determinacy diagrams and
content of the relations for Performers, Fees, Agents and Venues remain un-
changed from third normal forms, and are not repeated here. There are changes
to the Events and Bookings relations, which are illustrated below. A summary
of the determinacy diagrams and the relations for this example are given in
the ‘Summary of normalisation rules’ section of this chapter, together with an
entity-relationship diagram.
Event details The choice of event-id as the primary key for the Bookings
relation means that we can show the simpler representation of the determinacy
diagram for event details, as we no longer have to consider the attribute event-
17
name as a possible key.
18
Note that in Scenario 2, where there was no separate Events relation, it would
now be necessary to create an Events relation in order to transform the Bookings
relation from third normal form into Boyce-Codd normal form.
Booking details Now that we have decided that event-id is the more suitable
attribute for use as part of the key for the Bookings relation, we no longer
need to store the event-name, which is already held in the Events relation. The
problem of the overlapping keys has now been resolved, and the key for the
Bookings relation is the combination of the attributes performer-id, agent-id,
19
venue-id and event-id.
The Bookings relation no longer needs to hold the attribute event-name, as this
is already held in the Events relation.
Relation in Boyce-Codd normal form: Bookings
20
Exercise 1
Define Boyce-Codd normal form.
The normalisations process so far has produced a set of five relations, which are
robust against insertion, amendment and deletion anomalies. If at this stage
it were decided to introduce further details into the relations, it would still be
possible to do so. Database designers and developers would be well advised to
start again with the normalisations process if changes are proposed to the data.
However, for this example, we will introduce some new information that only
affects the performers.
We are now required to add further details to the Performers relation, to show
their knowledge and skills in two other areas: languages and hobbies.
Important
Fourth normal form (4NF) A relation is in fourth normal form if there are
no multi-valued dependencies between the attributes.
Multi-valued dependencies occur where there are a number of attributes that
depend only on the primary key, but exist independently of each other.
21
The representation of languages spoken and hobbies would be a simple enough
requirement if each performer spoke exactly one language and had only one
hobby. However, our performers are multi-talented, and some speak many
languages, and others have several hobbies. Furthermore, the languages and
hobbies are not related to each other. This presents us with a problem: how
can we represent this in the relation? We know we cannot have a group of items
under the headings Languages and Hobbies, as this would contravene the rules
for first normal form.
The relation below is an attempt at representing some of this information, using
a small number of performers as an example.
Relation: Some-Performers-Example 1
Relation: Some-Performers-Example 1
If we look at this relation, while it conforms to the rules for first normal form
(there are no repeating groups), there is still some ambiguity in its meaning. If
we look at Baron’s hobbies, we can see that ‘art’ has been identified, but that
there is no entry for the attribute ‘language’. Does this mean that Baron does
not speak any other languages? We know this is not true, because there are
other entries that demonstrate that Baron speaks three languages. If we take
the alternative view, and look at another entry for Baron, we can see that Baron
speaks Italian, but from this entry it could appear that Baron has no hobbies.
This approach is not the solution to the problem.
Another attempted solution pairs languages and hobbies together, but some-
times there is a language but no hobby (or the other way around).
Relation: Some-Performers-Example 2
Relation: Some-Performers-Example 2
22
In this new approach, we have entered each hobby against a language. However,
we are still faced with problems. If Steed decides to give up poetry as a hobby,
we will lose the information that Steed speaks English. If Baron’s French gets
‘rusty’ and is deleted from the relation, we will lose the information that Baron’s
hobby is art.
Relation: Some-Performers-Example 3
Relation: Some-Performers-Example 3
In this next attempt, all languages are paired with all hobbies; this means that
there is a great amount of redundancy, as basic data about the performers is
repeated each time. We have a problem with Jones, who does not appear to
have a hobby, which questions whether this entry is valid. In addition, if Steed
learns a new language, it would be necessary to repeat this new language paired
with Steed’s existing hobbies. This option is also an unsatisfactory method of
solution.
The solution to this problem is to divide the information that we are trying
to represent into a group of new relations; one containing the basic performer
information as before, another showing details of languages spoken, and a third
maintaining a record of hobbies.
This transformation deals with the problems of multi-valued facts and associated
redundancies in the data; we can now convert the relation into three relations
23
in fourth normal form.
Naturally, the new relations would hold data for all the performers, although
only an extract from each relation is given here.
Relation in fourth normal form: Some-Performers
Exercise 2
24
Define fourth normal form.
The rules used for converting a group of data items which are un-normalised into
a collection of normalised relations are summarised in the table below. Remem-
ber that in some cases we might choose not to normalise the data completely,
if this would lead to inefficiency in the database.
The conversion from fourth normal form to fifth normal form is included for
completeness. We will not be examining the definition of fifth normal form in
detail; it is concerned with avoiding unnecessary duplication of tuples when new
relations are created by joining together existing relations. The cause of this
problem is the existence of interdependent multi-valued data items.
We now have five relations which are fully normalised, and can be represented by
means of determinacy diagrams, relations, and an entity-relationship diagram.
Each relation has a corresponding entity in the entity-relationship diagram.
Performer details
25
All performers appear in the Performers relation. The primary key is performer-
id, and the other attributes are performer-name, performer-code (which identi-
fies the performer-type) and performer-location.
Fee details
Each performer is paid a fee depending on the performer-type. The rates of
pay for each performer-type are stored in the Fees relation, together with a
performer-code, which is the primary key.
26
Agent details
All agents are recorded in the Agents relation, where the primary key is agent-id,
and the remaining attributes are agent-name and agent-location.
27
Venue details
There are a number of venues available for bookings, and these are stored in
the Venues relation. The primary key is venue-id, and the other attributes are
venue-name and venue-location.
28
Event details
All events which can be booked are listed in the Events relation; the primary
key is event-id, and the other attributes are event-name and event-type.
29
Booking details
Every booking made by an agent, for a performer, at a venue, for an event, is
stored in the Bookings relation. The primary key is a combination of performer-
id, agent-id, venue-id and event-id; the remaining attribute is booking date.
30
31
Note that this assumes there can only be one booking involving a particular
combination of performer, agent, venue and event. This means that we cannot
have multiple bookings made involving the same performer, agent, venue and
event, as the primary key would be the same for each booking and we would
therefore lose unique identification of bookings.
In order to accommodate multiple bookings involving the same entities, we
could include the booking date as part of the key, but then we would not be
able to distinguish between morning and evening performances on the same date
(unless we included time as well as date).
Entity-relationship diagram
The determinacy diagrams and relations, which are now fully normalised, can
also be viewed as entities linked by relationships using the data modelling tech-
nique described in the chapter on entity-relationship modelling. Each determi-
nacy diagram represents a relation, which in turn corresponds to an entity, as
can be seen in the entity-relationship diagram below. The relationships that
exist between each entity are summarised below the diagram.
32
This entity relationship diagram represents the following:
• Each performer may have many bookings, so the relationship between
performer and booking is one-to-many.
• A performer earns a fee, the value of which depends on the performer type,
so the relationship between performer and fee is one-to-one (a performer
can only be of one type).
• An agent may make a number of bookings, so the relationship between
agent and booking is one-to-many.
• Any venue may have been booked several times, which makes the relation-
ship between venue and booking one-to-many.
• Each event may be involved in a number of bookings, so this relationship
is also one-to-many.
33
• The relationships that exist between performers, agents, venues and events
are shown by their connections through the bookings.
Exercise 3
Why have so many normal forms?
When moving to a higher normal form, we often have a choice about the way
in which we can decompose a relation into a number of other relations. This
section examines problems that can arise if the wrong decomposition is chosen.
As an example, supposing within a government department responsible for in-
telligence gathering, we wish to record details of employees in the department,
and the levels of their security clearance, which describe the levels of access
employees have to secret information. The table might contain the following
attributes:
Relation EMPLOYEE (Employee, Security_code, Level)
Where Employee is the primary key and provides some convenient means of
identifying each employee, Security_code identifies the security clearance of that
employee, and Level identifies the level of access to secret information possessed
by anyone having that Security_code.
The determinants in this relation are:
Employee determines Security_code
Security_code determines Level
So we have two functional dependencies, respectively Security_code is func-
tionally dependent on Employee, and Level is functionally dependent on Se-
curity_code. We also have a transitive, or indirect dependency, of Level on
Employee; that is, an employee’s level of security clearance does depend on who
that employee is, but only via the value of their Security_code.
This relation is in second normal form; i.e. it contains no repeating groups and
no part-key dependencies, but it does contain a transitive dependency.
In order to convert relation EMPLOYEE to third normal form, we need to
decompose it to remove the transitive dependency of Level on Employee. Until
we make this decomposition, we have the following insert, update and deletion
anomalies:
• We cannot create a new Security_code until we have an Employee to
whom we wish to allocate it.
34
• If we change the Level of a Security_code, i.e. change the Level of infor-
mation employees who hold that code can access, then in relation EM-
PLOYEE, we would have to propagate the update throughout all the
employees who hold that Level.
• If we remove the last Employee holding a particular Security_code, we
loose the information about the Level of clearance assigned to that Secu-
rity_code.
To perform the decomposition, suppose we split relation EMPLOYEE into two
relations as follows:
Decomposition A
Relation EMPLOYEE-CLEARANCE (Employee, Level)
Relation SECURITY_LEVEL (Security_code, Level)
There are problems with this decomposition. Supposing we wish to change the
security clearance for a given Employee. We can change the value of Level in
relation SECURITY_CLEARANCE, but unfortunately, this update is not in-
dependent of the data held in relation SECURITY_LEVEL. In order for the
change to have taken place in the SECURITY-CLEARANCE relation, one of
two things must have arisen. Either the Employee in question has changed
his/her Security_code, in which case no update need be made to relation SE-
CURITY_LEVEL, or the Level associated with the Security_code possessed by
the Employee has changed, in which case relation SECURITY_LEVEL must
be changed to reflect this.
The problem has arisen because the two relations in decomposition A are not
independent of one another. There is in fact a functional dependency between
them: the fact that Security_code is functionally dependent on Employee. In de-
composition A, instead of storing in the same relation those data items that are
functionally dependent on one another, we have split the functional dependency
of Security_code on Employee across the two relations, preserving the transi-
tive dependency of Level on Employee in relation SECURITY_CLEARANCE.
The problems this gives is that we cannot then make updates to one of these
relations without considering whether updates are required to the other. As a
further example, if we make updates to relation SECURITY_LEVEL, changing
the Level of access associated with each Security_code, we must make sure that
these updates are propagated to relation SECURITY_CLEARANCE, i.e. that
the employees who possess the altered security codes have their Level attribute
updated to reflect the changes in the SECURITY_LEVEL relation.
35
tions, rather than splitting them between the different relations. For the above
example, the correct decomposition would therefore be as follows:
Decomposition B
Relation EMPLOYEE_CODE (Employee, Security_code)
Relation ACCESS_LEVEL (Security_code, Level)
This decomposition allows us to manipulate the level of security granted to an
individual employee (in relation EMPLOYEE_CODE) independently of that
which specifies in general the level of access associated with security codes (main-
tained in relation ACCESS_LEVEL).
Denormalisation
36
Company and Location, via Department. To convert relation COMPANY to
third normal form, we would decompose it into:
Relation DEPARTMENT (Company, Department)
Relation LOCATION (Department, Location)
This would avoid the insert, update and deletion anomalies associated with
second normal form relations, giving us the ability to manipulate the information
about which departments make up a particular company quite independently
of the information about where particular departments are located. This is the
additional flexibility provided by taking the step of converting the application
to third normal form. However, if we wish to store information about a large
number of companies and departments, and we will not need to manipulate the
location information about departments independently of the information about
which departments make up a company, then we may choose not to proceed to
third normal form, but to leave relation COMPANY in second normal form.
Thus we’d retain the Company, Department and Location attributes in one
relation, where they can be queried and manipulated together, without the
need for JOINs.
If we choose to leave relation COMPANY in second normal form, what we will
have lost in terms of flexibility is as follows:
• The ability to create new departments at specific locations without allo-
cating them to a specific company.
• The ability to create new departments at specific locations without allo-
cating them to a specific company.
Note that in this particular case, the update anomaly does not arise, as each
department is assumed to appear only once in relation COMPANY. Users of
the application may feel that the flexibility provided by third normal form is
simply not required in this application, in which case we can opt for the second
normal form design, with its improved performance.
The same arguments apply when considering whether to take any steps that
lead to a higher normal form; there is always a trade-off similar to the above,
between the increased flexibility of a more normalised design versus a faster-
running application in a less normalised design, due to the smaller number of
relations. When Relational database systems first arrived in the early ’80s, their
performance was generally slow, and this had an influence in slowing down their
adoption by some companies. Since that time, a huge amount of research and
development has gone into improving the performance of Relational systems.
This development work, plus the considerable improvements in the processing
speed of hardware, tends to suggest that the need to denormalise applications
should be reduced; however, it remains an important option for application
designers seeking to develop well-tuned applications.
37
Over-normalisation
38
There are a number of reasons why relations may be fragmented in this way, most
of which are directly concerned with improving performance. These objectives
are briefly examined below:
• Splitting a large table into a number of smaller tables often reduces the
number of rows or columns that need to be scanned by specific query or
update transactions for an application. This is particularly true when
the partitions created are a good match to different business functions
of the application - for example, in the customer table example above, if
customers in different regions of the world undergo different types of query
and update processing.
• The smaller tables retrieved by queries restricted to one, or even a few, of
a number of partitions will take up less space in main memory than the
large number of rows fetched by a query on a large table. This often means
that the small quantity of data is able to remain in memory, available for
further processing if required, rather than being swapped back to disk, as
would be likely to happen with a larger data set.
Just as a good match to business functions for the chosen partitions means
the over-normalisation process will work well in improving performance, a poor
match to transaction requirements could lead to a poorer performance, because
the over-normalised design could lead to an increased number of JOINs.
Review questions
Review question 1
Consider the following scenario.
A database is being designed to store details of a hospital’s clinics and the
doctors who work in them. Each doctor is associated with just one hospital.
Each clinic has a two-to-three-hour session during which a doctor who is a
specialist in a particular field of medicine, sees patients with problems in that
specialist area; for example, a diabetic clinic would be run by a doctor who is
a specialist in diabetes. The same clinic may occur in a number of different
hospitals; for example, several hospitals may run a diabetes clinic. Doctors may
hold a number of different clinics. Clinic within the same hospital are always
held by the same doctor. The relation is therefore of the following form:
Relation CLINIC (Hospital, Clinic, Doctor)
A row in the relation signifies that a particular clinic is held by a particular
doctor in a specific hospital.
1. What are the determinants in the above relation?
2. Demonstrate that relation CLINIC is not in BCNF. Which normal form
is it in?
39
3. Explain any insertion, update or deletion problems that might arise with
relation CLINIC. How might these be resolved?
Review question 2
A library holds a database of the loans and services used by its members. Each
member may borrow up to 10 books at a time, and may reserve sessions making
use of library facilities, such as time on an Internet PC, a Multimedia PC,
booking a ticket for a performance at the library theatre, etc.
Describe how the above scenario could be handled in the process of normalisa-
tion.
Review question 3
It is required to develop a database which can provide information about soccer
teams: the number of games they have played, won, drawn and lost, and their
current position in the league.
Write down your thoughts on the issues involved in supporting the requirement
to provide the current league position, and how this is best satisfied.
Review question 4
If BCNF is a stronger definition of 3NF, and provides a more concise definition
of the normalisation process, why is it worth understanding the step-by-step
processes of moving from an un-normalised design to 3NF? Why is BCNF a
stronger normal form than 3NF?
Review question 5
A binary relation is a relation containing just two attributes. Is it true that any
binary relation must be in BCNF?
Review question 6
What is the difference between a repeating group and a multi-valued depen-
dency?
Review question 7
True or false: Splitting a functional dependency between two relations when
decomposing to a higher normal form is to be preferred to splitting a transitive
dependency. Give reasons to justify your assertion.
Review question 8
Explain the difference between denormalisation and over-normalisation. What
do the two techniques have in common, and what differences do they have?
Review question 9
A chemical plant produces chemical products in batches. Raw materials are
fed through various chemical processes called a production run, which turns
the raw materials into a final product. Each batch has a unique number, as
40
does each product produced by the plant. We can also assume that product
names are unique. Each production run results in the production of a quantity
of a particular product. We assume that only one product is produced in any
given production run, and so different production runs are required for different
products. The design of a relation to store the details of production runs could
look as follows:
Relation PRODUCTION_RUN (Product_no, Product_name, Batch_no,
Quantity)
In which normal form is relation PRODUCTION_RUN? Explain the reasoning
behind your assertion.
Resolve any anomalies that could arise in the manipulation of rows in the PRO-
DUCTION_RUN relation.
Review question 10
A company wishes to store details of its employees, their qualifications and
hobbies. Each employee has a number of qualifications, and independently of
these, a number of hobbies.
Produce a normalised design for storing this information.
Review question 11
We saw earlier the issues surrounding storing some types of derived data; for
example, the league position of soccer teams. Supposing we wish to store the
number of points accumulated by such teams, given the rules that:
• A win is awarded 3 points
• A draw is awarded 1 point
• There are no points for a defeat
Consider any problems that might be associated with the storage of the number
of points obtained by teams in such a league.
Discussion topic
Normalisation has been the second major technique we have examined for use
in the design of database applications, the other being entity-relationship mod-
elling. You are encouraged to discuss your feelings about the relative usefulness
of these two approaches with respect to the following:
• Learnability. Which of the two techniques have you found easier to learn?
Do not settle for merely identifying which technique was easier to learn,
but examine what it is about the techniques that makes one or other of
them easier to learn.
41
• Usability. Which of the techniques have you found easier to use so far in
the chapters you have worked through, and which would you expect to be
more useful in commercial application development? Be sure to back your
assertions with an explanation of why you believe them to be true.
Finally, consider what you believe the relative strengths and weaknesses of the
two design approaches to be, and consider to what extent these are or are not
complementary.
42
Chapter 10. Declarative Constraints and
Database Triggers
Table of contents
• Objectives
• Introduction
• Context
• Declarative constraints
– The PRIMARY KEY constraint
– The NOT NULL constraint
– The UNIQUE constraint
– The CHECK constraint
∗ Declaration of a basic CHECK constraint
∗ Complex CHECK constraints
– The FOREIGN KEY constraint
∗ CASCADE
∗ SET NULL
∗ SET DEFAULT
∗ NO ACTION
• Changing the definition of a table
– Add a new column
– Modify an existing column’s type
– Modify an existing column’s constraint definition
– Add a new constraint
– Drop an existing constraint
• Database triggers
– Types of triggers
∗ Event
∗ Level
∗ Timing
– Valid trigger types
• Creating triggers
– Statement-level trigger
∗ Option for the UPDATE event
– Row-level triggers
∗ Option for the row-level triggers
– Removing triggers
– Using triggers to maintain referential integrity
– Using triggers to maintain business rules
• Additional features of Oracle
– Stored procedures
– Function and packages
– Creating procedures
– Creating functions
1
– Calling a procedure from within a function and vice versa
• Discussion topics
• Additional content and activities
Objectives
Introduction
In parallel with this chapter, you should read Chapter 8 of Thomas Connolly
and Carolyn Begg, “Database Systems A Practical Approach to Design, Imple-
mentation, and Management”, (5th edn.).
This chapter introduces you to some of the most advanced features of Relational
databases and SQL, namely declarative constraints, database triggers and stored
procedures. These features have been made available in popular DBMSs such
as Oracle. They provide those DBMSs with greater flexibility and power in
dealing with complexities in many demanding business applications.
The reason for studying these advanced database features is that we need to
address a growing trend of providing mechanisms for the processing as well as
storage of data in database systems. Declarative constraints are a means of
recording some types of business rules within a database system, and by doing
so, have them systematically applied across all the applications operating on the
database. Database triggers and stored procedures are additional mechanisms
provided in some of the most powerful DBMSs (e.g. Oracle) for storing and
applying logic at the database rather than application level.
The contents of this chapter are closely related to some of the others in this
module. The distribution of processing in an application is an area of design
that has developed with the evolution of client-server computing. A database de-
signer now has choices about whether to place some aspects of business logic at
the server (where the database resides), by having them built into the database
system and enforced at that level, or at the client where it is enforced at the
2
application level. This chapter extends the SQL constructs studied in the chap-
ter on Advanced SQL and discusses how business rules can be captured at the
database level. Because of the design decisions that need to be made about the
placing of business logic, this chapter also relates to the two on database design,
and the chapter on distributed databases and client-server applications.
Because different DBMSs may implement those advanced features in different
ways, our study will be focused on the related functionality provided by Or-
acle. Oracle PL/SQL statements will be used to provide examples to enable
detailed discussions. Other DBMSs should provide similar functionalities, but
you should consult with the system’s documentation should you come across
any incompatibilities. All SQL statements are in capital letters.
Context
For any complex database application, it is likely that there will be two or more
tables involved to store information. It is also likely that data within the same
table or in different tables will have to maintain some kind of relationship to
reflect the corresponding business logic. In addition, some attributes (columns)
of a table may need to have certain conditions imposed on them, and these
conditions, which are often used to capture necessary business rules, need to be
satisfied at all times.
In order to accommodate these practical needs of database applications, the
SQL standard provides mechanisms to maintain the integrity of databases; that
is, the integrity of data within a single table or in different tables as a whole.
Declarative constraints are one of such mechanisms. They are used to define,
according to the business application rules, conditions on columns and tables.
Once defined (i.e. declared), these conditions will be enforced by the DBMS
automatically.
As can be seen from the above description, the types of declarative constraints
that can be declared are predefined by the DBMS, which conforms to the SQL
standard. Usually they are used to store and enforce the kinds of business rules
which are generally needed across different applications. Although they are sim-
ple to use and maintain, they lack some necessary flexibility and may not always
be able to satisfy some specific needs of individual applications. To compensate
for this, some DBMSs (e.g. Oracle) provide another type of mechanism to ensure
database integrity: database triggers and stored procedures.
A procedure is a set of SQL or PL/SQL (in the case of Oracle) statements used
together to execute a particular function. Database triggers are a mechanism
that allows a database designer to write procedures that are automatically exe-
cuted whenever a predefined situation (an event) is raised by the execution of
INSERT, UPDATE or DELETE statements on a table or view. Because the
database designer is responsible for creating triggers and writing procedures,
he/she has an overall control. This control can be used to capture and build
3
business logic into the database as necessary. As a result, this mechanism of-
fers greater flexibility and fewer restrictions for the designer to develop complex
database applications. In short, database triggers and procedures are not only
able to enforce integrity constraints, but can also be used to write customised
functions to satisfy individual applications’ needs.
In the rest of this chapter, we are going to study in detail declarative constraints,
database triggers and procedures. We will see how they are used in practical
applications, what the advantages and drawbacks are, and what the solutions
are to potential problems.
To facilitate detailed discussions, suppose we need to implement a database for a
university. The basic requirements state that there are four entities: STUDENT,
MODULE, LECTURER and DEPT. A student can attend as many modules as
necessary, and a module must be attended by at least one student. A module
must be taught by one and only one lecturer, but a lecturer may teach between
one and four modules. A student should be enrolled to a department; a module
should be offered by one and only one department; a lecturer should belong to
one and only one department.
It is not difficult to see that we will need to implement five tables: four tables
for the four entities and one table (called RECORD) for the many-to-many
relationship between STUDENT and MODULE.
• STUDENT (SID, SNAME, DNAME, SLEVEL, SEMAIL).
• LECTURER (EID, LNAME, LEMAIL, DNAME).
• MODULE (CODE, TITLE, EID, DNAME).
• DEPT (DNAME, LOCATION).
• RECORD (SID, CODE, MARK).
In the STUDENT table, SID is the student’s identity number and the primary
key, SNAME is the student’s name, DNAME is the department to which the
student has enrolled, SLEVEL is the level the student is at, and SEMAIL is
the student’s email address. In the LECTURER table, EID is the employee
identity number for the lecturer and the primary key, LNAME is the lecturer’s
name, LEMAIL is the lecturer’s email address and ENAME is the name of the
department. In the MODULE table, CODE is the code of the module and
the primary key, TITLE is the title of the module, EID is the name of the
lecturer taking the module and DNAME is the name of the department the
module belongs to. The DEPT table has only two attributes, department name
DNAME (primary key) and location of the department in the university. In the
RECORD table, SID is the student number, CODE is the code of the module
and MARK is the mark a student obtained from attending a module. The SID
and CODE makes a primary key.
4
Declarative constraints
Constraints are a mechanism provided within the DDL SQL standard to main-
tain the consistency and integrity of a database and, at the same time, enforce
certain business rules in the database application. There are five different types
of declarative constraints in SQL that can be defined on a database column
within a table, and they are as follows:
• PRIMARY KEY
• NOT NULL
• UNIQUE
• CHECK
• FOREIGN KEY
The PRIMARY KEY constraint is used to maintain the so-called entity integrity.
When such a constraint is declared on a column of a table, the DBMS enforces
the following rules:
• The column value must be unique within the table.
• The value must exist for any tuple (a record or a row of data) that is to
be stored in the table. That is, the column cannot have a NULL value.
For the STUDENT table in our university database, for example, we have SID
as the key attribute. As a normal business rule, all students must have a valid
and unique ID number as soon as they are enrolled. Thus, the SID column must
have a unique value and cannot be null. To enforce this business rule, we can
have the PRIMARY KEY constraint declared on the column when creating the
STUDENT table. One way to do this is:
CREATE TABLE STUDENT (
SID NUMBER(5) CONSTRAINT PK_STUDENT PRIMARY KEY,
SNAME VARCHAR2(30),
DNAME VARCHAR2(30),
SLEVEL NUMBER(1),
SEMAIL VARCHAR2(40));
In the above SQL statement, the constraint is declared by using the keywords
CONSTRAINT and PRIMARY KEY. A column definition clause with such a
constraint declaration is called a column constraint clause. “PK_STUDENT”
is a user-defined name for the constraint. It is optional, but when defined, it
5
can help the database designer and user to pinpoint a violation of this con-
straint. The reason is that when this particular constraint is violated, the
DBMS will generate an error/warning message which includes the constraint’s
name. A usual convention for defining a PRIMARY KEY constraint’s name is
“PK_Table_Name”.
There is an alternative way to declare the PRIMARY KEY constraint:
CREATE TABLE STUDENT (
SID NUMBER(5),
SNAME VARCHAR2(30),
DNAME VARCHAR2(30),
SLEVEL NUMBER(1),
SEMAIL VARCHAR2(40),
CONSTRAINT PK_STUDENT PRIMARY KEY (SID));
OR
CREATE TABLE STUDENT (
SID NUMBER(5),
SNAME VARCHAR2(30),
DNAME VARCHAR2(30),
SLEVEL NUMBER(1),
SEMAIL VARCHAR2(40),
PRIMARY KEY (SID));
In this SQL statement, a separate clause (called table constraint clause) is used
to define the constraint. The column name (e.g. SID) must be explicitly stated
in the list (i.e. within the brackets ( )). If the table has a composite key, then
the list will include all the key attributes. For example, to create the RECORD
table, we have:
CREATE TABLE RECORD (
SID NUMBER(5),
CODE VARCHAR2(6),
MARK NUMBER(3),
CONSTRAINT PK_RECORD PRIMARY KEY (SID, CODE));
By enforcing the PRIMARY KEY constraint, the DBMS can prevent any at-
tempt or mistake of inserting or updating a student record with a duplicate
student number. It also ensures that every student on record has a valid ID
6
number. In the RECORD table, it ensures that each record has a unique com-
bination of SID and CODE values, which means that a student will never be
allowed to have two or more records for the same module.
It must be emphasised that a table can have at most one PRIMARY KEY
constraint, and it is actually optional (a table does not have to have a PRIMARY
KEY constraint). However, it is rare that a table be created without such a
constraint, because tables usually do have a primary key.
Review question 1
1. What types of constraints can be declared in SQL?
2. What rules are enforced by the PRIMARY KEY constraint?
3. Is it true that a table must have at least one PRIMARY KEY constraint?
The NOT NULL constraint is imposed on any column that must have a value.
In the STUDENT table, for example, the attributes DNAME and SLEVEL can
have this constraint declared on them to reflect the application requirement that
whenever a student is enrolled, he/she must be assigned to a department and
be at a certain level.
To declare the constraint on DNAME and SLEVEL, we can use the following
SQL statement to create table STUDENT:
CREATE TABLE STUDENT (
SID NUMBER(5),
SNAME VARCHAR2(30),
DNAME VARCHAR2(30) CONSTRAINT NN_STUDENT_DNAME NOT
NULL,
SLEVEL NUMBER(1) NOT NULL,
SEMAIL VARCHAR2(40),
CONSTRAINT PK_STUDENT PRIMARY KEY (SID));
You may have noticed that the constraint on DNAME has been given a user-
defined name “NN_STUDENT_DNAME”, while the one on SLEVEL has not.
It is optional to name a NOT NULL constraint. Unlike the PRIMARY KEY
constraint, it does not make much difference whether or not you choose to
define a name for the constraint. In Oracle, when the NOT NULL constraint is
violated, the system will generate an error message. However, this message will
not include the name of the NOT NULL constraint, even if one is defined.
7
Also notice that when a constraint is not to be given a user-defined name, the
keyword CONSTRAINT is not used. The same applies to other constraint
definitions.
The UNIQUE constraint is the same as the PRIMARY KEY constraint, except
NULL values are allowed. In the STUDENT table, for example, the SEMAIL
attribute should have this constraint. The reason is that according to the uni-
versity’s policy, a student may or may not be given an email account. However,
when one is given, the email account name must be unique. By enforcing this
constraint on SEMAIL, the DBMS can ensure that different students will not
be allowed to have the same email addresses. For those who do not have an
email account, the SEMAIL column can have NULL values.
To declare the UNIQUE constraint on SEMAIL, we can use the following SQL
statement to create table STUDENT:
CREATE TABLE STUDENT (
SID NUMBER(5),
SNAME VARCHAR2(30),
DNAME VARCHAR2(30) NOT NULL,
SLEVEL NUMBER(1) NOT NULL,
SEMAIL VARCHAR2(40) CONSTRAINT UK_STUDENT_SEMAIL
UNIQUE,
CONSTRAINT PK_STUDENT PRIMARY KEY (SID));
Again, an optional user-defined name “UK_STUDENT_SEAMIL” is given to
the constraint. This is a good practice in Oracle, because when the UNIQUE
constraint is violated, the system will generate an error message containing the
name. Similar to the PRIMARY KEY constraint, the constraint’s name helps
pinpoint the violation. You can avoid giving the constraint a name and just use
the UNIQUE keyword:
SEMAIL VARCHAR2(40) UNIQUE
Review question 2
Why is it a good practice to give a name to a declarative constraint?
8
The CHECK constraint defines a discrete list of values that a column can have.
This list of values may be literally expressed within the constraint declaration
or may be defined using a mathematical expression. In the STUDENT table,
for example, a student must be at a level between 0 and 3. To impose such a
constraint, the CREATE statement for the STUDENT table will be as follows:
CREATE TABLE STUDENT (
SID NUMBER(5),
SNAME VARCHAR2(30),
DNAME VARCHAR2(30) NOT NULL,
SLEVEL NUMBER(1) NOT NULL CONSTRAINT CK_STUDENT_LEVEL
CHECK ((SLEVEL>=0) AND (SLEVEL<=3)),
SEMAIL VARCHAR2(40) CONSTRAINT UK_STUDENT_SEMAIL
UNIQUE,
CONSTRAINT PK_STUDENT PRIMARY KEY (SID));
Notice two things in the above CREATE statement. First, the CHECK con-
straint can be declared in a column constraint clause and concatenated (linked)
with other NOT NULL, UNIQUE and/or PRIMARY KEY constraints. When
a column constraint clause is concatenated, there is no separator between the
different constraints, just a comma after the last constraint. Second, the check
condition (e.g. (SLEVEL>=0) AND (SLEVEL<=3)) can include logical con-
nectors such as AND and OR. Thus, it is possible to define a complex condition.
Alternatively, the CHECK constraint can be defined using a table constraint
clause, such as:
CREATE TABLE STUDENT (
SID NUMBER(5),
SNAME VARCHAR2(30),
DNAME VARCHAR2(30) NOT NULL,
SLEVEL NUMBER(1) NOT NULL,
SEMAIL VARCHAR2(40) CONSTRAINT UK_STUDENT_SEMAIL
UNIQUE,
CONSTRAINT PK_STUDENT PRIMARY KEY (SID),
CONSTRAINT CK_STUDENT_LEVEL CHECK ((SLEVEL>=0) AND
(SLEVEL<=3)));
It is worth mentioning that when the CHECK constraint is applied to a list of
literal values, the values are case sensitive. For example, if only students in the
Department of Computing Science or Information Technology are allowed to be
9
in the database, a CHECK constraint is defined on DNAME in the following
way:
……
DNAME VARCHAR2(30) NOT NULL, ……,
CHECK (DNAME IN (‘Computing Science’, ‘Information Technology’)), ……;
Any value that does not exactly match the specified values (including ‘Comput-
ing science’) will cause a violation.
We saw in earlier chapters, when introducing the Relational model, that entities
are often linked by a one-to-many relationship. For example, a department
may contain many employees, so we say there is a one-to-many relationship
between instances of the department entity and instances of the employee entity.
10
Entities related in this way are sometimes referred to as parents and children;
in the example above, the parent entity would be the department table, and the
employee entity would be the child table.
A foreign key is a column or a set of columns (attributes) that links each row
in the child table containing the foreign key to the row of the parent table
containing the matching key value. The FOREIGN KEY constraint enforces
referential integrity, which means that, if the foreign key contains a value, that
value must refer to an existing, valid row in the parent table.
In our university database, for example, SID and CODE are foreign keys in the
RECORD table (notice that SID and CODE together form the primary key
for RECORD as well), and RECORD has two parent tables STUDENT and
MODULE. For a row in RECORD, there must be an existing student row with
the same SID value in STUDENT, and a valid row in MODULE with the same
CODE value. Otherwise, the referential integrity is broken. One important
implication of this, is that when using FOREIGN KEY constraints, the parent
tables must be created before the child tables, and the parent tables must be
populated before the child tables, in order to avoid constraint violations. It is
important to bear in mind this required order of doing things when undertaking
practical work involving FOREIGN KEY constraints.
The following SQL statement can be used to declare the FOREIGN KEY con-
straints on SID and CODE when creating the RECORD table.
CREATE TABLE RECORD (
SID NUMBER(5),
CODE VARCHAR2(6),
MARK NUMBER(3),
CONSTRAINT PK_RECORD PRIMARY KEY (SID, CODE),
CONSTRAINT FK_RECORD_SID FOREIGN KEY (SID) REFERENCES
STUDENT,
FOREIGN KEY (CODE) REFERENCES MODULE);
It can be seen from the above example that:
• The FOREIGN KEY constraint can be given an optional name. In the
example, FK_RECORD_SID is the name for the constraint on SID. To
define the name, the keyword CONSTRAINT must be used. Otherwise,
it is omitted as in the case of declaring the constraint on CODE.
• The keywords FOREIGN KEY define which column (or columns) is the
foreign key column to be constrained.
• The keyword REFERENCES indicates the parent table.
11
By declaring and enforcing the FOREIGN KEY constraint, the DBMS can
ensure that the referential integrity is maintained in both the child table(s) and
the parent table(s).
In the child table, the DBMS will not allow any INSERT or UPDATE operation
that attempts to create a foreign key value without a matching candidate key
value in the corresponding parent table (as indicated by the REFERENCES
clause).
In the parent table, the DBMS ensures that appropriate actions are taken for any
UPDATE or DELETE operation that attempts to change or delete a candidate
key value that is being referenced by some rows in the child table. The kind of
actions that can be taken are user definable. They are CASCADE, SET NULL,
SET DEFAULT and NO ACTION.
CASCADE
This action can be triggered by either a DELETE or an UPDATE operation.
When a parent row is deleted, all its child rows are also deleted. This action
can subsequently be applied to each child row deleted, because such rows may
themselves have a candidate key that is used as a foreign key in some other
tables. Thus, this action may be executed in a cascading manner.
The CASCADE option can be specified in SQL as follows:
CREATE TABLE RECORD (
SID NUMBER(5),
CODE VARCHAR2(6),
MARK NUMBER(3),
CONSTRAINT PK_RECORD PRIMARY KEY (SID, CODE),
FOREIGN KEY (SID) REFERENCES STUDENT ON DELETE CASCADE,
FOREIGN KEY (CODE) REFERENCES MODULE);
In this example, when a student row is deleted from the STUDENT table, all
his/her records will also be removed from the RECORD table.
When the candidate key value is changed (by a UPDATE operation), the foreign
key column in the child table is set to the same new value. Similar to CASCADE
by DELETE, such update actions can be carried out in a cascading manner to
the child tables of the child table and so on. For example, when a student’s
identity number (SID) is changed, all his/her records in the RECORD table
should have the new SID value to replace the old. In Oracle, such an action can
be defined by creating a trigger (to be discussed later).
12
SET NULL
When a row is deleted from the parent table, all its child rows will have their
corresponding foreign key column set to NULL. This option is only valid if the
foreign key column allows NULL value (i.e. it has neither the PRIMARY KEY
constraint nor the NOT NULL constraint).
Similarly, when the candidate key value of the parent row is changed, all its child
rows may have their corresponding foreign key column set to NULL. Again, this
option is valid if and only if the foreign key column allows NULL value.
The SET NULL option can be specified in Oracle by creating corresponding
triggers.
SET DEFAULT
By having this option, the operation of deleting the parent row or updating the
candidate key value in the parent table will set the corresponding foreign key
column in the child table to its default value. This option is only valid if the
foreign key column has a DEFAULT value specified.
Again in Oracle, this option can be implemented using appropriate triggers.
NO ACTION
This is the option by default. If there is no other option specified, the DBMS
will reject any DELETE or UPDATE in the parent table that may affect rows
in the child tables. Any such illegal attempt (to break the referential integrity)
will raise an error message in Oracle.
Review question 4
1. Does the keyword CONSTRAINT always need to be used in declaring a
constraint?
2. What are the rules enforced by the FOREIGN KEY constraint?
Activity 1 - Creating tables with appropriate constraints
For the university database described in the Context section, we now want to
use SQL to create five tables as specified below:
STUDENT
• SID: a five-digit number, which is also the primary key of the table.
• SNAME: a string of characters; maximum length is 30.
• SLEVEL: a single-digit integer; must have a value.
• SEMAIL: a string of characters; maximum length is 40; must be unique.
• DNAME: foreign key referring to the DEPT table; must have a value.
13
MODULE
• CODE: a string of 6 letters and/or numbers; primary key of the table.
• TITLE: a string of characters; maximum length is 45; must be unique.
• EID: foreign key referring to the LECTURER table; must have a value.
• DNAME: foreign key referring to the DEPT table; must have a value.
LECTURER
• EID: a six-digit number; primary key of the table.
• LNAME: a string of characters; maximum length is 30.
• LEMAIL: a string of characters; maximum length is 40; must be unique.
• DNAME: foreign key referring to the DEPT table; must have a value.
DEPT
• DNAME: a string of characters; maximum length is 30; primary key of
the table.
• LOCATION: a string of characters; maximum length is 35; must have a
value.
RECORD
• SID: foreign key referring to the STUDENT table; primary key attribute.
• CODE: foreign key referring to the MODULE table; primary key attribute.
• MARK: an integer.
Activity 2 - Inserting data into the tables and enforcing constraints
Having created the five tables, we can now insert data records into them. The
records that are to be stored are listed below. Insert each of them and see
what happens after the insertion of the highlighted rows, bearing in mind the
constraints that some of the columns may have.
14
15
Activity 3 - Maintaining referential integrity
We have learned that by declaring and enforcing the FOREIGN KEY constraint,
the DBMS can ensure that the referential integrity is maintained in both the
child table(s) and the parent table(s).
In the child table, the DBMS will not allow any INSERT or UPDATE operation
that attempts to create a foreign key value without a matching candidate key
value in the corresponding parent table (as indicated by the REFERENCES
clause). We have seen one of such examples in Activity 2.
In the parent table, the DBMS ensures that appropriate actions are taken for any
UPDATE or DELETE operation that attempts to change or delete a candidate
key value that is being referenced by some rows in the child table.
Now try to perform the following two operations on our university database and
see what happens:
• Operation 1: Change the name of the Department of Computing Science
to simply ‘Computing’ in the DEPT table.
• Operation 2: Delete all level 3 students from the STUDENT table.
16
Changing the definition of a table
Once created, a table’s definition can still be changed using the ALTER TABLE
command in SQL. Different DBMSs implement ALTER TABLE differently, pro-
viding more or less functionality than that specified in the SQL standard. In
Oracle, the following operations can be carried out on a table using appropriate
ALTER TABLE statements:
• Add a new column, including a constraint declaration for that column.
• Modify an existing column’s type, with certain restrictions.
• Modify an existing column’s constraint definition, with certain restric-
tions.
• Add a new constraint to an existing column.
• Drop (remove) an existing constraint from a column.
Using the ALTER TABLE command, we can modify the type definition of a
column with the following restrictions:
17
• If there is data (except NULL) present in the column, then the type of
this column cannot be changed. The type definition can only be changed
if the table is empty, or all values in the column concerned are NULL.
• If the type definition is changed on a column with a UNIQUE or PRI-
MARY KEY constraint, it may potentially become incompatible with the
data type of a referencing FOREIGN KEY. Thus, the ALTER TABLE
command should be used with caution.
The following SQL statement changes the data type of SID in the RECORD
table to NUMBER(9) (the original type was NUMBER(5)):
ALTER TABLE RECORD
MODIFY SID NUMBER(9);
Notice that SID in RECORD is a foreign key referencing SID in STUDENT
which still has the type NUMBER(5). However, because NUMBER(9) and
NUMBER(5) are compatible, this ALTER TABLE operation is allowed. If we
attempt to change SID to be of type VARCHAR2(5), it will be rejected because
of incompatible data types.
New constraints, such as UNIQUE, CHECK and FOREIGN KEY, can be added
to a column.
18
In our university database, for example, we may add a UNIQUE constraint on
SNAME in the STUDENT table. As a result, no students can have the same
name in the database.
ALTER TABLE STUDENT
ADD CONSTRAINT UK_STUDENT_SNAME UNIQUE(SNAME);
In the RECORD table, if we want to ensure that MARK is always less than or
equal to 100, we can add a CHECK constraint on MARK as follows:
ALTER TABLE RECORD ADD CONSTRAINT CK_RECORD_MARK
CHECK (MARK <= 100);
In the LECTURER table, DNAME is a foreign key with a link to the DEPT
table. If we did not declare a FOREIGN KEY constraint on DNAME when
creating LECTURER, we can add it now using the following statement:
ALTER TABLE LECTURER
ADD CONSTRAINT FK_LECTURER_DNAME
FOREIGN KEY (DNAME) REFERENCES DEPT;
All the keywords in the statements are highlighted. Notice that we have given
names to all the newly added constraints. They will be helpful when constraints
have to be dropped.
19
1. How does one change the definition of a constraint on a column?
2. How does one remove an existing constraint?
Activity 4 - Changing an existing column’s constraint
In Activity 2, when we were trying to insert the following record:
into the STUDENT table, an error occurred. This was because there was a
NOT NULL constraint declared on SLEVEL. As a result, the SLEVEL column
cannot take NULL values. The insertion did not take place.
In this activity, we will change the constraint on SLEVEL from NOT NULL to
NULL so that NULL value is allowed. Write an SQL statement to perform the
change and then re-insert the record into the STUDENT table. It should now
be held in the table.
Activity 5 - Adding a new constraint to a column
When the RECORD table was created, there was no constraint on MARK. As a
normal business rule, however, we know that a student’s mark should always be
between 0 and 100. Thus, we can declare an appropriate constraint to enforce
this rule. Since the RECORD table has already been created, we need to use
the ALTER TABLE command to add the new constraint.
Write a proper SQL statement to perform the required operation. Having added
the new constraint, try to increase all the marks in module CS1234 by 80 and
see what happens.
Activity 6 - Modifying an existing FOREIGN KEY constraint
In Activity 3, we could not delete level 3 students from the STUDENT table,
because they were still being referenced by some child rows via the foreign key
SID in the RECORD table.
Now suppose that we want to relax the constraint a bit so that we may remove
student rows from the STUDENT table together with their child rows. In this
case, the FOREIGN KEY constraint on SID in RECORD needs be changed to
include the option to allow cascade deletions.
Write proper SQL statements to perform the required modification. Remember
that for existing constraints such as FOREIGN KEY, UNIQUE and CHECK, if
they need be modified, they have to be removed first and then new ones added.
Now, having modified the constraint, perform the operation to delete all level
3 students from the STUDENT table. Then check both the STUDENT and
RECORD tables to see what rows have been deleted.
20
Database triggers
A trigger defines an action the database should take when some database-related
event occurs. Triggers may be used to:
• Supplement declarative constraints, to maintain database integrity.
• Enforce complex business rules.
• Audit changes to data.
Different DBMSs may implement the trigger mechanism differently. In this
chapter, we use Oracle to discuss triggers. In Oracle, a trigger consists of a set of
PL/SQL statements. The execution of triggers is transparent to the user. They
are executed by the DBMS when specific types of data manipulation commands
are performed on specific tables. Such commands include INSERT, UPDATE
and DELETE.
Because of their flexibility, triggers may supplement database integrity con-
straints. However, they should not be used to replace them. When enforcing
business rules in an application, you should first rely on the declarative con-
straints available in the DBMS (e.g. Oracle); only use triggers to enforce rules
that cannot be coded through declarative constraints. This is because the en-
forcement of the declarative constraints is more efficient than the execution of
user-created triggers.
It is worth mentioning that in order to create a trigger on a table, you must be
able to alter that table and any other table that may subsequently be affected
by the trigger’s action. You need to ensure that you have sufficient privilege to
do so.
Types of triggers
In Oracle, there are fourteen types of triggers that can be implemented using
PL/SQL. Once again, note that other DBMSs may not have the same support,
and that you should consult your system’s documentation if you encounter any
problems. The type of a trigger is defined by the following three features:
• Event
• Level
• Timing
Event
Refers to the triggering SQL statement; that is, INSERT, UPDATE or DELETE.
A single trigger can be designed to fire on any combination of these SQL state-
ments.
21
Level
Refers to statement-level versus row-level triggers. The level of a trigger denotes
whether the trigger fires once per SQL statement or once for each row affected
by the SQL statement.
Statement-level triggers execute once for each SQL statement. For example, if
an UPDATE statement updates 300 rows in a table, the statement-level trigger
of that table would only be executed once. Thus, these triggers are not often
used for data-related activities. Instead, they are normally used to enforce
additional security measures on the types of transactions that may be performed
on a table.
Statement-level triggers are the default type of triggers created via the CREATE
TRIGGER command.
Row-level triggers execute once for each row operated upon by a SQL statement.
For example, if an UPDATE statement updates 300 rows in a table, the row-level
trigger of that table would be executed 300 times. Also, row-level triggers have
access to column values of the row currently being operated upon by the SQL
statement. They can evaluate the contents of each column for that row. Thus,
they are the most common type of triggers and are often used in data-auditing
applications.
Row-level triggers are created using the FOR EACH ROW clause in the CRE-
ATE TRIGGER command.
It is important to know that a trigger can only be associated with one table,
but a table can have a mixture of different types of triggers.
Timing
Timing denotes whether the trigger fires BEFORE or AFTER the statement-
level or row-level execution. In other words, triggers can be set to occur im-
mediately before or after those triggering events (i.e. INSERT, UPDATE and
DELETE).
Within the trigger, one will be able to reference the old and new values involved
in the transaction. ‘Old’ refers to the data as it existed prior to the transaction.
UPDATE and DELETE operations usually reference such old values. ‘New’ val-
ues are the data values that the transaction creates (such as being INSERTed).
If one needs to set a column value in an inserted row via a trigger, then a BE-
FORE INSERT trigger is required in order to access the ‘new’ values. Using an
AFTER INSERT trigger would not allow one to set the inserted value, since the
row will already have been inserted into the table. For example, the BEFORE
INSERT trigger can be used to check if the column values to be inserted are
valid or not. If there is an invalid value (according to some pre-specified business
22
rules), the trigger can take action to modify it. Then only validated values will
be inserted into the table.
AFTER row-level triggers are often used in auditing applications, since they do
not fire until the row has been modified. Because the row has been successfully
modified, it implies that it has satisfied the referential integrity constraints
defined for that table.
In Oracle, there is a special BEFORE type of trigger called an INSTEAD OF
trigger. Using an INSTEAD OF trigger, one can instruct Oracle what to do
instead of executing the SQL statement that has activated the trigger. The code
in the INSTEAD OF trigger is executed in place of the INSERT, UPDATE or
DELETE triggering transaction.
23
The first twelve types of triggers are most commonly used, and are discussed in
this chapter. The INSTEAD OF triggers are more complex than others, and
interested students are advised to refer to books specifically dealing with Oracle
PL/SQL programming.
Review question 6
What is a database trigger? Should triggers be used to replace declarative
constraints and why?
Creating triggers
Statement-level trigger
24
In this case, any of the INSERT UPDATE, and DELETE operations will
activate the trigger. Also notice that instead of using a CREATE TRIG-
GER clause, we use CREATE OR REPLACE TRIGGER. Because the
“first_trigger_on_student” trigger is already in existence, the keywords
CREATE OR REPLACE are used. For defining new triggers, the keyword
CREATE alone is sufficient
Row-level triggers
To define a row-level trigger, the FOR EACH ROW clause must be included
in the CREATE TRIGGER statement. For example, if we want to have a
trigger for each INSERT, UPDATE and DELETE operation on every row that
is affected, we can create the trigger in the following way:
CREATE TRIGGER third_trigger_on_student
AFTER INSERT OR UPDATE OR DELETE ON STUDENT
FOR EACH ROW
BEGIN
[the trigger body consisting of PL/SQL code;]
END;
The trigger will fire whenever a row has been inserted, updated or deleted. At
the row-level, we can also add the option for the UPDATE event.
25
CREATE OR REPLACE TRIGGER third_trigger_on_student
AFTER INSERT OR UPDATE OF DNAME OR DELETE ON STUDENT
FOR EACH ROW
BEGIN
[the trigger body consisting of PL/SQL code;]
END;
26
FOR EACH ROW WHEN (OLD.DNAME = ‘English’)
BEGIN
[the trigger body consisting of PL/SQL code;]
END;
Compare “fifth_trigger_on_student” with “fourth_trigger_on_student”, and
see how they are different.
Removing triggers
Existing triggers can be deleted via the DROP TRIGGER command. For ex-
ample, the “first_trigger_on_student” trigger is removed from the STUDENT
table in the following way:
DROP TRIGGER first_trigger_on_student;
As studied in the earlier part of this chapter, the FOREIGN KEY constraint is
often used for ensuring the referential integrity among parent and child tables.
However, the FOREIGN KEY constraint can only enforce standard integrity
rules. They are:
• The foreign key column in the child table cannot reference non-existing
rows in the parent table.
• If the DELETE CASCADE option is not chosen, a row in the parent table
that is being referenced via a foreign key column cannot be deleted.
• If the DELETE CASCADE option is chosen, the row can be deleted to-
gether with all the rows in the child table which reference the parent row.
If other non-standard rules have to be enforced as well, then appropriate triggers
need to be created. Some possible non-standard rules are:
• Cascade updates.
• Set the foreign key column to NULL on updates and deletes.
• Set a default value to the foreign key column on updates and deletes. The
meanings of these rules have been explained before.
It must be emphasised that if triggers are used instead of the standard FOR-
EIGN KEY constraint, then for each of the integrity rules (standard and non-
standard), one or more triggers may need to be implemented. Also, the FOR-
EIGN KEY constraint must not be declared when creating the corresponding
tables. Otherwise, the triggers will not work, because the standard FOREIGN
KEY constraint will override the trigger actions.
27
In this section, we are going to see two examples of using triggers to implement
the DELETE CASCADE rule and the UPDATE CASCADE rule. The two
tables concerned are STUDENT and RECORD:
STUDENT(SID, SNAME, DNAME, SLEVEL, SEMAIL)
RECORD(SID, CODE, MARK)
We know that SID in RECORD is a foreign key linking to STUDENT.
To create the trigger to cascade deletes:
CREATE TRIGGER cascade_deletes_student_record
BEFORE DELETE ON STUDENT
FOR EACH ROW
BEGIN
DELETE FROM RECORD
WHERE RECORD.SID = :OLD.SID;
END;
It can be seen from the above example that, before the parent row is deleted from
the STUDENT table, all the child rows in the RECORD table are deleted. This
maintains the referential integrity. (In the PL/ SQL code, :OLD.SID represents
the SID of the row in the STUDENT table, which is to be deleted.)
To create the trigger to cascade updates:
CREATE TRIGGER cascade_updates_student_record
AFTER UPDATE OF SID ON STUDENT
FOR EACH ROW
BEGIN
UPDATE RECORD
SET RECORD.SID = :NEW.SID
WHERE RECORD.SID = :OLD.SID;
END;
Again, it can be seen from the example that, after the parent row is updated
in the STUDENT table, all the child rows in the RECORD table are updated
accordingly. This maintains the referential integrity.
28
Using triggers to maintain business rules
29
FOR EACH ROW
BEGIN
IF ((:NEW.MARK/:OLD.MARK) >= 1.1) OR ((:OLD.MARK/:NEW.MARK)
>= 1.1)
THEN
RAISE_APPLICATION_ERROR(-20002, ‘Warning: Large percentage change
in marks prohibited.’);
END IF;
END;
The above two examples should have shown you what triggers are capable of
doing. In fact, using Oracle’s PL/SQL language, one can write much more
complex triggers to enforce various business rules. The discussion of PL/SQL is
beyond the scope of this module. Interested students are again advised to refer
to any book specifically dealing with Oracle PL/SQL programming for more
information on writing triggers.
Review question 7
1. When and how do we use triggers to maintain referential integrity?
2. How do we use triggers to implement business rules in the database?
Activity 7 - Creating triggers to prevent updates and deletions
In the university database, we can see that the rows in the DEPT table are often
referenced by many child rows in a number of other tables (e.g. STUDENT,
LECTURER and MODULE). Although there are FOREIGN KEY constraints
declared on the child tables to maintain the referential integrity, we can still
define a trigger in the parent table (i.e. DEPT) to stop any attempt to change
the name of the department and/or to remove any of the DEPT rows. This is
corresponding to the business rule stating that once a university department is
established, it will be there ‘forever’ and will not be allowed to change name
(we assume that such a rule is necessary).
In this activity, write appropriate PL/SQL statements to create the trigger.
After the trigger is created, try to change the name of some departments in
DEPT and delete a row from DEPT, and see what happens. Find out how the
trigger works.
Note that in order to execute PL/SQL statements in some DBMSs (including
Oracle’s SQL*PLUS), you may need to end the block of statements with a ‘/’.
Consult your DBMS’s documentation for any additional semantics requirements.
Activity 8 - Creating triggers to maintain data validity
In Activity 5, we have declared a CHECK constraint on MARK in the RECORD
table. Any mark that is not between 0 and 100 will cause a violation of the
30
constraint. Applying the CHECK constraint, however, we would not know
whether the mark is greater than 100 or smaller than 0 (i.e. a negative number).
In this activity, we will create a trigger to replace the original CHECK constraint,
which can tell us how the restriction on MARK is violated. Whenever the value
of MARK is beyond the valid range (0 – 100), the trigger will generate an error
message informing users whether it is greater than 100 or a negative number.
Write proper PL/SQL statements to create the trigger, and use some SQL UP-
DATE and INSERT statements to test it. Remember we need to drop the
CHECK constraint first, otherwise it will override any other triggers on MARK.
Activity 9 - Creating triggers to validate new column values
In the STUDENT table, the column SLEVEL can only take value 0, 1, 2, 3 or
NULL. Any other value is illegal. In order to ensure that only valid values are
stored in SLEVEL, we can create a trigger to automatically validate any new
value to be updated or inserted. The rules are:
• If a new value is smaller than 0, then set it to 0 before updating or inserting
it.
• An appropriate message is always displayed on the screen to inform the
user that a proper validation has been carried out.
Write proper PL/SQL statements to create the trigger, and use some SQL UP-
DATE and INSERT statements to test it.
Note that the Oracle’s DBMS_OUTPUT.PUT_LINE procedure can be used
to display text on screen. The basic syntax is:
DBMS_OUTPUT.PUT_LINE (‘the text to be displayed in single quotes’);
Remember that for Oracle, in order to use DBMS_OUTPUT.PUT_LINE to
display text on screen, you need to execute the command “SET SERVEROUT-
PUT ON” once in the SQL*PLUS environment.
The rest of this chapter deals with some additional features that are present in
Oracle. In other advanced DBMSs, similar features are also available. Again,
please consult your DBMS’s documentation to find out whether or not it sup-
ports the following features.
Stored procedures
31
procedures associated with tables and invoked (called upon) by pre-specified
events.
Stored procedures, containing SQL or PL/SQL statements, allow one to move
code that enforces business rules from the application to the database. As a
result, the code can be stored once for use by different applications. Also, the
use of stored procedures can make one’s application code more consistent and
easier to maintain. This principle is similar to the good practice in general
programming in which common functionality should be coded separately as
procedures or functions.
Some of the most important advantages of using stored procedures are sum-
marised as follows:
• Because the processing of complex business rules can be performed within
the database, significant performance improvement can be obtained in
a networked client-server environment (refer to client-server chapters for
more information).
• Since the procedural code is stored within the database and is fairly static,
applications may benefit from the reuse of the same queries within the
database. For example, the second time a procedure is executed, the
DBMS may be able to take advantage of the parsing that was previously
performed, improving the performance of the procedure’s execution.
• Consolidating business rules within the database means they no longer
need to be written into each application, saving time during application
creation and simplifying the maintenance process. In other words, there
is no need to reinvent the wheel in individual applications, when the rules
are available in the form of procedures.
32
In the example, DBMS_OUTPUT is the package containing a number of pro-
cedures relating to displaying message on the screen. PUT_LINE is a proce-
dure within the DBMS_OUTPUT package that can take a string of characters
(i.e. the text in the single quotes) and output them onto the screen.
Note that in order to use Oracle’s DBMS_OUTPUT.PUT_LINE procedure to
display text on screen, you need to execute the command SET SERVEROUT-
PUT ON once in the SQL*PLUS environment.
Creating procedures
33
Creating functions
34
Calling a procedure from within a function and vice versa
Discussion topics
Having studied this chapter, we should have obtained a fair amount of knowledge
about declarative constraints, database triggers and stored procedures. We have
seen in Oracle that PL/SQL is a very useful as well as powerful language for
creating triggers to enforce business rules.
Now use any database application with which you may be familiar (e.g. for a
bank, a car-rental company, etc) to discuss in general what kind of application
logic and/or business rules should be implemented in the database using con-
straints and triggers. The objective of this discussion is to help you understand
further the usefulness and benefits of the techniques.
In this chapter, we have studied the five declarative constraints and the mecha-
nisms for creating database triggers and procedures in more advanced DBMSs
35
such as Oracle. We have seen that the constraints, database triggers and proce-
dures can be effectively used to incorporate application logic and enforce busi-
ness rules in databases.
A number of examples have been provided in this chapter. As additional ac-
tivities, you may wish to try out those examples, using the university database
that has been created during previous activities.
We have also seen in this chapter that the PL/SQL language of Oracle plays
a very important role in creating database triggers and procedures. In fact,
PL/SQL is a powerful language in Oracle that enables us to construct flexible
triggers and procedures to deal with various complex application issues. Al-
though we were not able to cover PL/SQL to a sufficient extent, interested
students who want to develop further knowledge on PL/SQL are advised to
read relevant books.
36
Chapter 11. File Organisation and Indexes
Table of contents
• Objectives
• Introduction
• Context
• Organising files and records on disk
– Record and record type
– Fixed-length and variable-length records in files
– Allocating records to blocks
– File headers
– Operations on files
• File organisations - organising records in files
– Heap file organisation
– Sorted sequential file organisation
∗ Binary search algorithm
∗ Performance issues
– Hash file organisation
∗ Hashing techniques
∗ External hashing
∗ Dynamic hashing
∗ Performance issues
• Single-level ordered indexes
– Primary indexes
∗ Performance issues
– Clustering indexes
∗ Performance issues
– Secondary indexes
∗ Index on key field
∗ Performance issues
∗ Index on a non-key field
– Summary of single-level ordered indexes
• Multilevel indexes
– The principle
– The structure
– Performance issues
• Dynamic multilevel indexes using B-trees and B+ trees
– The tree data structure
– Search trees
∗ Definition of a search tree
– B-trees: Balanced trees
∗ Definition of a B-tree
∗ Performance issues
– B+ trees
∗ Definition of a B+ tree
1
∗ Performance issues
∗ Search, insertion and deletion with B+ trees
∗ Dealing with overflow
∗ Dealing with underflow
– B* tree: A variation of B-tree and B+ tree
– Summary
Objectives
Introduction
In parallel with this chapter, you should read Chapter 16 and Chapter 17 of
Ramez Elmasri and Shamkant B. Navathe, ” FUNDAMENTALS OF Database
Systems“, (7th edn.).
In this chapter, we will study how files and records can be placed on disks, and
what the effective ways are in which records can be organised in files. The three
file organisations we will learn in this chapter are heap file, sorted file and hash
file. The only one goal in applying these various file organisation techniques is
to provide the database system with good performance.
It is not only important that a database is designed well, but part of the design
process also involves ensuring that the structure of the developed system is
efficient enough to satisfy users’ requirements now and into the future.
Database tuning involves techniques that can be used to improve performance.
It is an important and complex subject to which a number of chapters are de-
voted in this module. We will be looking into design issues of indexes, and
the appropriate ways of using indexes. Indexes play a similar role in database
2
systems as they do in books, in that they are used to speed up access to infor-
mation. File structures can be affected by different indexing techniques, and
they in turn will affect the performance of the databases.
It is worth emphasising again the importance of file organisations and their
related access methods. Tuning techniques can help improve performance, but
only to the extent that is allowed by a particular file organisation. For example,
if you have a Ford car, you may obtain a better performance if you have the
car’s engine tuned. However, no matter how hard you tune it, you will never be
able to get a Ferrari’s performance out of it.
Indexes can help database developers build efficient file structures and offer effec-
tive access methods. When properly used and tuned, the database performance
can be improved further. In fact, indexes are probably the single most impor-
tant mechanism explicitly available to database developers and administrators
for tuning the performance of a database.
The contents of this chapter are related to discussions on Database Administra-
tion and Further Performance Tuning Techniques.
Context
In this chapter, we will describe the techniques used to store large amounts of
structured data on disks. These techniques are important for database design-
ers, DBAs (database administrators) and implementers of a DBMS. Database
designers and DBAs must know the advantages and disadvantages of each stor-
age method in order to develop and operate a DBMS for a specific application.
Even an off-the-shelf DBMS will usually have different options available for or-
ganising the data, and the process of physical database design involves selecting
the most appropriate data organisation technique for the given set of application
requirements.
A typical database application will always need to access the database and
retrieve some data for processing. Whenever a certain portion of the data is
needed, it must be located on disk, loaded to main memory for processing, and
then written back to the disk if changes have been made.
The data stored on disk is organised as files of records. Each record is a collection
of data values that can be interpreted as facts about entities, their attributes
and relationships. Records should be stored on disk in a manner that makes it
possible to locate them efficiently whenever they are needed.
There are a number of commonly used file organisations which can determine
how the records of a file are physically placed on disk. In this chapter, we will
be discussing:
• Heap file: which places records on disk in no particular order.
3
• Sorted sequential file: which holds records in a particular order based
on the value of a specified field (i.e. attribute).
• Hashed file: which uses a hash function to decide where a record should
be placed on disk.
In this chapter, we will also introduce access structures called indexes, which
are used to speed up the retrieval of records if certain requirements on search
conditions are met. An index for a file of records works just like an index
catalogue in a library. In a normal library environment, for example, there
should be catalogues such as author indexes and book title indexes. A user may
use one of these indexes to quickly find the location of a required book, if he/she
knows the author(s) or title of the book.
Each index (access structure) offers an access path to records. Some types of
indexes, called secondary access paths, do not affect the physical placement of
records on disk; rather, they provide alternative search paths for locating the
records efficiently based on the indexing fields. Other types of indexes can only
be constructed on a file with a particular primary organisation.
The focus of our study in this chapter will be on the following:
• Primary indexes
• Clustering indexes
• Secondary indexes
• Multilevel indexes
• B-tree and B+ tree structures.
It must be emphasised that different indexes have their own advantages and
disadvantages. There is no universally efficient index. Each technique is best
suited for a particular type of database application.
The merits of indexes can be measured in the following aspects:
• Access types: The kind of access methods that can be supported effi-
ciently (e.g. value-based search or range search).
• Access time: Time needed to locate a particular record or a set of
records.
• Insertion efficiency: How efficient an insertion operation is.
• Deletion efficiency: How efficient a deletion operation is.
• Storage overhead: The additional storage requirement of an index struc-
ture.
It is worth noting that a file of records can have more than one index, just like
for books there can be different indexes such as author index and title index.
4
An index access structure is usually constructed on a single field of a record in
a file, called an indexing field. Such an index typically stores each value of the
indexing field, along with a list of pointers to all disk blocks that contain records
with that field value. The values in the index are usually sorted (ordered) so
that we can perform an efficient binary search on the index.
To give you an intuitive understanding of an index, we look at library indexes
again. For example, an author index in a library will have entries for all authors
whose books are stored in the library. AUTHOR is the indexing field and all
the names are sorted according to alphabetical order. For a particular author,
the index entry will contain locations (i.e. pointers) of all the books this author
has written. If you know the name of the author, you will be able to use this
index to find his/her books quickly. What happens if you do not have an index
to use? This is similar to using a heap file and linear search. You will have to
browse through the whole library looking for the book.
An index file is much smaller than the data file, and therefore searching the
index using a binary search can be carried out quickly. Multilevel indexing goes
one step further in the sense that it eliminates the need for a binary search, by
building indexes to the index itself. We will be discussing these techniques later
on in the chapter.
In this section, we will briefly define the concepts of records, record types and
files. Then we will discuss various techniques for organising file records on disk.
A record is a unit which data is usually stored in. Each record is a collection
of related data items, where each item is formed of one or more bytes and
corresponds to a particular field of the record. Records usually describe entities
and their attributes. A collection of field (item) names and their corresponding
data types constitutes a record type. In short, we may say that a record type
corresponds to an entity type and a record of a specific type represents an
instance of the corresponding entity type.
The following is an example of a record type and its record:
5
A specific record of the STUDENT type:
STUDENT(9901536, “James Bond”, “1 Bond Street, London”, “Intelligent Ser-
vices”, 9)
A file basically contains a sequence of records. Usually all records in a file are
of the same record type. If every record in the file has the same size in bytes,
the records are called fixed-length records. If records in the file have different
sizes, they are called variable-length records.
Variable-length records may exist in a file for the following reasons:
• Although they may be of the same type, one or more of the fields may be
of varying length. For instance, students’ names are of different lengths.
• The records are of the same type, but one or more of the fields may be a
repeating field with multiple values.
• If one or more fields are optional, not all records (of the same type) will
have values for them.
• A file may contain records of different record types. In this case, records
in the file are likely to have different sizes.
For fixed-length records, the exact size of each record can be determined in
advance. As a result, they can easily be allocated to blocks (a block is the unit
of transfer of data between main memory and disk). Also, we can identify the
starting byte position of each field relative to the starting position of the record,
because each of such records has the same fields, and the field lengths are fixed
and known beforehand. This provides us with a simple way to find field values
of a record in the file.
For records with variable-length fields, we may not know the exact lengths of
those fields in advance. In order to determine the bytes that are needed to
6
accommodate those fields, we can use special separator characters, which do
not appear in any field value (such as ˜, @, or !), to terminate the variable-
length fields. An alternative to this approach is to store the exact length of a
variable-length field explicitly in the record concerned.
A repeating field needs one separator character to separate the repeating values
of the field, and another separator character to indicate termination of the field.
In short, we need to find out the exact size of a variable-length record before
allocating it to a block or blocks. It is also apparent that programs that process
files of variable-length records will be more complex than those for fixed-length
records, where the starting position and size of each field are known and fixed.
We have seen that fixed-length records have advantages over variable-length
records with respect to storage and retrieving a field value within the record.
In some cases, therefore, it is possible and may also be advantageous to use
a fixed-length record structure to represent a record that may logically be of
variable length.
For example, we can use a fixed-length record structure that is large enough to
accommodate the largest variable-length record anticipated in the file. For a
repeating field, we could allocate as many spaces in each record as the maximum
number of values that the field can take. In the case of optional fields, we may
have every field included in every file record. If an optional field is not applicable
to a certain record, a special null value is stored in the field. By adopting such
an approach, however, it is likely that a large amount of space will be wasted
in exchange for easier storage and retrieval.
The records of a file must be allocated to disk blocks because a block is the
unit of data transfer between disk and main memory. When the record size is
smaller than the block size, a block can accommodate many such records. If
a record has too large a size to be fitted in one block, two or more blocks will
have to be used.
In order to enable further discussions, suppose the size of the block is B bytes,
and a file contains fixed-length records of size R bytes. If B # R, then we can
allocate bfr = #(B/R)# records into one block, where #(x)# is the so-called
floor function which rounds the value x down to the next integer. The value bfr
is defined as the blocking factor for the file.
In general, R may not divide B exactly, so there will be some leftover spaces in
each block equal to B – (bfr * R) bytes .
If we do not want to waste the unused spaces in the blocks, we may choose to
store part of a record in them and the rest of the record in another block. A
pointer at the end of the first block points to the block containing the other
part of the record, in case it is not the next consecutive block on disk. This
7
organisation is called ‘spanned’, because records can span more than one block.
If records are not allowed to cross block boundaries, the organisation is called
‘unspanned’.
Unspanned organisation is useful for fixed-length records with a length R #
B. It makes each record start at a known location in the block, simplifying
record processing. For variable-length records, either a spanned or unspanned
organisation can be used. It is normally advantageous to use spanning to reduce
the wasted space in each block.
For variable-length records using spanned organisation, each block may store
a different number of records. In this case, the blocking factor bfr represents
the average number of records per block for the file. We can then use bfr to
calculate the number of blocks (b) needed to accommodate a file of r records:
b = #(r/bfr)# blocks
where #(x)# is the so-called ceiling function which rounds the value x up to
the nearest integer.
It is not difficult to see that if the record size R is bigger than the block size B,
then spanned organisation has to be used.
File headers
Operations on files
Operations on files can usually be grouped into retrieval operations and update
operations. The former do not change anything in the file, but only locate
8
certain records for further processing. The latter change the file by inserting or
deleting or modifying some records.
Typically, a DBMS can issue requests to carry out the following operations (with
assistance from the operating-system file/disk managers):
• Find (or Locate): Searches for the first record satisfying a search con-
dition (a condition specifying the criteria that the desired records must
satisfy). Transfers the block containing that record into a buffer (if it is
not already in main memory). The record is located in the buffer and
becomes the current record (ready to be processed).
• Read (or Get): Copies the current record from the buffer to a program
variable. This command may also advance the current record pointer to
the next record in the file.
• FindNext: Searches for the next record in the file that satisfies the search
condition. Transfers the block containing that record into a buffer, and
the record becomes the current record.
• Delete: Deletes the current record and updates the file on disk to reflect
the change requested.
• Modify: Modifies some field values for the current record and updates
the file on disk to reflect the modification.
• Insert: Inserts a new record in the file by locating the block where the
record is to be inserted, transferring that block into a buffer, writing the
(new) record into the buffer, and writing the buffer to the disk file to reflect
the insertion.
• FindAll: Locates all the records in the file that satisfy a search condition.
• FindOrdered: Retrieves all the records in the file in a specified order.
• Reorganise: Rearranges records in a file according to certain criteria.
An example is the ‘sort’ operation, which organises records according to
the values of specified field(s).
• Open: Prepares a file for access by retrieving the file header and preparing
buffers for subsequent file operations.
• Close: Signals the end of using a file.
Before we move on, two concepts must be clarified:
• File organisation: This concept generally refers to the organisation of
data into records, blocks and access structures. It includes the way in
which records and blocks are placed on disk and interlinked. Access struc-
tures are particularly important. They determine how records in a file
are interlinked logically as well as physically, and therefore dictate what
access methods may be used.
9
• Access method: This consists of a set of programs that allow operations
to be performed on a file. Some access methods can only be applied to
files organised in certain ways. For example, indexed access methods can
only be used in indexed files.
In the following sections, we are going to study three file organisations, namely
heap files, sorted files and hash files, and their related access methods.
Review question 1
1. What are the different reasons for having variable-length records?
2. How can we determine the sizes of variable-length records with variable-
length fields when allocating them to disk?
3. When is it most useful to use fixed-length representations for a variable-
length record?
4. What information is stored in file headers?
5. What is the difference between a file organisation and an access method?
The heap file organisation is the simplest and most basic type of organisation.
In such an organisation, records are stored in the file in the order in which they
are inserted, and new records are always placed at the end of the file.
The insertion of a new record is very efficient. It is performed in the following
steps:
• The last disk block of the file is copied into a buffer.
• The new record is added.
• The block in the buffer is then rewritten back to the disk.
Remember the address of the last file block can always be kept in the file header.
The search for a record based on a search condition involves a linear search
through the file block by block, which is often a very inefficient process. If only
one record satisfies the condition then, on average, half of the file blocks will
have to be transferred into main memory before the desired record is found. If
no records or several records satisfy the search condition, all blocks will have to
be transferred.
To modify a record in a file, a program must:
• find it;
• transfer the block containing the record into a buffer;
10
• make the modifications in the buffer;
• then rewrite the block back to the disk.
As we have seen, the process of finding the record can be time-consuming.
To remove a record from a file, a program must:
• find it;
• transfer the block containing the record into a buffer;
• delete the record from the buffer;
• then rewrite the block back to the disk.
Again, the process of finding the record can be time-consuming.
Physical deletion of a record leaves unused space in the block. As a consequence,
a large amount of space may be wasted if frequent deletions have taken place. An
alternative method to physically deleting a record is to use an extra bit (called
a deletion marker) in all records. A record is deleted (logically) by setting the
deletion marker to a particular value. A different value of the marker indicates a
valid record (i.e. not deleted). Search programs will only consider valid records in
a block. Invalid records will be ignored as if they have been physically removed.
No matter what deletion technique is used, a heap file will require regular reor-
ganisation to reclaim the unused spaces due to record deletions. During such
reorganisation, the file blocks are accessed consecutively and some records may
need to be relocated to fill the unused spaces. An alternative approach to re-
organisation is to use the space of deleted records when inserting new records.
However, this alternative will require extra facilities to keep track of empty
locations.
Both spanned and unspanned organisations can be used for a heap file of ei-
ther fixed-length records or variable-length records. Modifying a variable-length
record may require deleting the old record and inserting a new record incorpo-
rating the required changes, because the modified record may not fit in its old
position on disk.
Exercise 1
A file has r = 20,000 STUDENT records of fixed length, each record has
the following fields: ID# (7 bytes), NAME (35 bytes), ADDRESS (46 bytes),
COURSE (12 bytes), and LEVEL (1 byte). An additional byte is used as a dele-
tion marker. This file is stored on the disk with the following characteristics:
block size B = 512 bytes; inter-block gap size G = 128 bytes; number of blocks
per track = 20; number of tracks per surface = 400. Do the following exercises:
1. Calculate the record size R.
2. Calculate the blocking factor bfr.
11
3. Calculate the number of file blocks b needed to store the records, assuming
an unspanned organisation.
Exercise 2
Suppose we now have a new disk device which has the following specifications:
block size B = 2400 bytes; inter-block gap size G = 600 bytes. The STUDENT
file in Exercise 1 is stored on this disk, using unspanned organisation. Do the
following exercises:
1. Calculate the blocking factor bfr.
2. Calculate the wasted space in each disk block because of the unspanned
organisation.
3. Calculate the number of file blocks b needed to store the records.
4. Calculate the average number of block accesses needed to search for an
arbitrary record in the file, using linear search.
Review question 2
Discuss the different techniques for record deletion.
Records in a file can be physically ordered based on the values of one of their
fields. Such a file organisation is called a sorted file, and the field used is called
the ordering field. If the ordering field is also a key field, then it is called the
ordering key for the file.
The following figure depicts a sorted file organisation containing the STUDENT
records:
12
13
The sorted file organisation has some advantages over unordered files, such as:
• Reading the records in order of the ordering field values becomes very
efficient, because no sorting is required. (Remember, one of the common
file operations is FindOrdered.)
• Locating the next record from the current one in order of the ordering field
usually requires no additional block accesses, because the next record is
often stored in the same block (unless the current record is the last one in
the block).
• Retrieval using a search condition based on the value of the ordering field
can be efficient when the binary search technique is used.
In general, the use of the ordering field is essential in obtaining the advantages.
Explanations:
14
• The binary search algorithm always begins from the middle block in the
file. The middle block is loaded into a buffer.
• Then the specified ordering key value K is compared with that of the first
record and the last record in the buffer.
• If K is smaller than the ordering field value of the first record, then it
means that the desired record must be in the first half of the file (if it is in
the file at all). In this case, a new binary search starts in the upper half
of the file and blocks in the lower half can be ignored.
• If K is bigger than the ordering field value of the last record, then it means
that the desired record must be in the second half of the file (if it is in the
file at all). In this case, a new binary search starts in the lower half of the
file and blocks in the upper half can be ignored.
• If K is between the ordering field values of the first and last records, then
it should be in the block already in the buffer. If not, it means the record
is not in the file at all.
Referring to the example in the figure above, suppose we want to find a student’s
record whose ID number is 9701890. We further assume that there are five blocks
in the file (i.e. the five blocks shown in the figure). Using the binary search, we
start in block 3 and find that 9701890 (the specified ordering field value) is
smaller than 9703501 (the ordering field value of the first record). Thus, we
move to block 2 and read it into the buffer. Again, we find that 9701890 is
smaller than 9702381 (the ordering field value of the first record of block 2). As
a result, we read in block 1 and find 9701890 is between 9701654 and 9702317.
If the record is in the file, it has to be in this block. By conducting a further
search in the buffer, we can find the record.
Performance issues
The sorted file organisation can offer very efficient retrieval performance only if
the search is based o