Chapter 2
Databases & Data Warehouses
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1
Outline
Database Concepts
Steps in Database Design
Entity-Relationship Model
Logical Database Design
Relational Model
Object-Relational Model
Physical Database Design
Data Warehouse Concepts
Logical Data Warehouse Design
Physical Data Warehouse Design
Data Warehouse Architecture
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 2
Steps in Database Design
Requirements specification: Collects information
about users’ needs with respect to the database
system
Conceptual design: Builds a user-oriented
representation of the database without any
implementation considerations
Logical design: Translates the conceptual schema
from the previous phase into an implementation
model common to several DBMSs, e.g., relational or
object-relational
Physical design: Customizes the logical schema from
the previous phase to a particular platform, e.g.,
Oracle or SQL Server
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 3
Entity Relationship Model: University
Example
Academic staff Participates Project
(0,n) (1,n)
Employee no Start date Project id
Name End date Project acronym
First name Project name
Last name Description (0,1)
Address Start date
Email End date
Homepage Budget
Research areas (1,n) Funding agency
(partial,exclusive) Teaches
(1,1) (0,n)
(0,n) (0,n)
Assistant Professor Section Prerequisite
Thesis title Status Semester Has prereq Is a prereq
Thesis description Tenure date (0,1) Year
/No courses Homepage (0,1) Course
(1,1)
(1,1) (0,n) Course id
(1,n)
Advisor CouSec Course name
Level
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 4
ER Model: Entity and Relationship Types
(1)
Entity Relationship Model: Most often used
conceptual model for database design
Entity type: Represents a set of real-world objects of
interest to an application
Entity: object belonging to an entity type
Relationship type: Represents an association
between objects
Relationship: association belonging to a relationship type
Role: Participation of an entity type in a relationship
type
Cardinalities in a role: Minimum and maximum
number of times that the entity may participate in
the relationship type
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 5
ER Model: Entity and Relationship Types
(2)
Types of roles
Optional vs mandatory: minimum cardinality 0 vs 1
Monovalued vs multivalued: maximum cardinality 1 vs n
Relationship types
Binary vs n-ary: two vs more than two participating entity
types
Binary relationship types:
One-to-one, one-to-many, or many-to-many depending on
the maximum cardinality of the two roles
Recursive relationship types: the same entity type
participates more than once in the relationship type
Role names are then necessary to distinguish different roles
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 6
ER Model: Attributes
Attributes: Structural characteristics describing
objects and relationships
Attributes have cardinalities, as for roles
Depending on cardinalities, attributes may be
Optional vs mandatory
Monovalued vs multivalued
Complex attributes: Attributes composed of other
attributes
Derived attributes: Their value for each instance may
be calculated from other elements of the schema
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 7
ER Model: Identifiers and Weak Entity
Types
Identifier or key: One or several attributes that uniquely
identify a particular object
Weak entity type: Do not have identifier of their own
Regular entity types are called strong entity types
A weak entity type is dependent on the existence of
another entity type, called the identifying or owner
entity type
An identifying relationship type relates a weak entity
type to its owner
Normal relationship types are called regular relationship types
Partial key: Set of attributes that can uniquely identify
weak entities related to the same owner entity
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 8
ER Model: Generalization (1)
Generalization or is-a relationship: Relates
perspectives of the same concept at different
abstraction levels
Supertype: Entity at higher abstraction level
Subtype: Entity at lower abstraction level
Population inclusion: Every instance of the subtype is
also an instance of the supertype
Inheritance: All characteristics (e.g., attributes, roles)
of the supertype are inherited by the subtype
Substitutability: Each time an instance of a
supertype is required, an instance of the subtype can
be used instead
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 9
ER Model: Generalization (2)
Several generalization relationships accounting for
the same criterion can be grouped
Types of generalizations
Total vs partial: depending on whether every instance of the
supertype is also an instance of one of the subtypes
Disjoint vs overlapping: depending on whether an instance
may belong to one or several subtypes
Multiple inheritance: A subtype that has several
supertypes
May induce conflicts when a property of the same name is
inherited from two supertypes
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 10
ER Model: Summary of Notation
Simple attribute
(0,1) Complex attribute
Professor Teaches role name Attribute 1
(1,1)
Attribute 2
Identifier (1,1) (0,n) Optional att. (0,1)
Attributes
Other attributes (1,n) Multivalued attr. (1,n)
Entity Type Relationship Type Role Cardinalities Attribute types
Section
(total,exclusive)
CouSec (partial,exclusive)
Partial Identifier
Other Attributes (total,overlapping)
(partial,overlapping)
Identifying
Weak Entity Type Relationship Type Generalization Generalization types
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 11
Relational Model: University Example
Academic staff Academic staff Project
Research area
Employee no Employee no Project id
First name Research area Project acronym
Last name Project name
Address Description (0,1)
Email Participates Start date
Homepage End date
Employee no Budget
Project id Funding agency
Professor Start date
End date
Employee no
Status Course
Tenure date (0,1) Section
No courses Course id
Course id Course name
Semester Level
Assistant Year
Homepage (0,1)
Employee no Employee no Prerequisite
Thesis title
Thesis description Course id
Advisor Has prereq
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 12
Relational Model (1)
Based on a simple data structure
A relation or table, composed of attributes or columns
Each attribute is defined over a domain or data type
Typical domains: integer, float, date, string, …
Tuple or row: element of the table
Provides several declarative integrity constraints
Not null attribute: a value for the attribute must be
provided
Key: column(s) that uniquely identify one row of the table
Simple vs composite key: depending on whether key is composed
of one or several columns
Primary vs alternate key: one of the keys must be chosen as
primary by the designer, the others are alternate keys
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 13
Relational Model (2)
Referential integrity
Defines a link between two tables (or twice the same one)
A set of attributes in one table (foreign key) references the
primary key of the other table
Values in the foreign key columns must also exist in the
primary key of the referenced table
Check constraint: defines a predicate that must be
valid when inserting or updating a tuple in a relation
Predicate can only involve the tuple being inserted/deleted
All other constraints must be expressed by triggers:
an event-condition-action rule that is automatically
activated when a relation is updated
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 14
Translating from ER to Relational
Schemas (1)
Rule 1: A strong entity type is associated with a table
Table contains the simple monovalued attributes and the
simple component of the monovalued complex attributes of
the entity type
Table also defines not null constraints for mandatory
attributes
Identifier of the entity
Academic staff type define the primary
Academic staff key of the
table Employee no Employee no
Name First name
First name Last name
Last name Address
Address Email
Email Homepage
Homepage
Research areas (1,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 15
Translating from ER to Relational
Schemas (2)
Rule 2: A weak entity type is transformed as a strong
entity type, but the associated table also contains
the identifier of the owner entity
A referential integrity constraint relates this identifier to the
table associated with the owner entity type
Primary key of the table: partial identifier of the weak entity
type + identifier of the owner entity type
Section CouSec Course Section
(1,1) (1,n)
Semester Course id Course id
Year ... Semester
... Year
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 16
Translating from ER to Relational
Schemas (3)
Rule 3: A binary one-to-one relationship type is
represented by including the identifier of one of the
participating entity types in the table associated to
the other entity type
A referential integrity constraint relates this identifier to the
table associated with the corresponding entity type
Simple monovalued attributes and simple components of
monovalued complex attributes are also included in the
table
Not null constraint for mandatory attributes are also defined
Department Dean Professor Department
(1,1) (0,1)
... ... ...
Dean
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 17
Translating from ER to Relational
Schemas (4)
Rule 4: A binary one-to-many relationship type is
represented by including the identifier of the entity
type with cardinality 1 in the table associated to the
other entity type
A corresponding referential integrity constraint is added
Attributes and not null constraints are also added as in
previous rules
Assistant Advisor Professor Assistant
(1,1) (0,n)
... ... ...
Advisor
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 18
Translating from ER to Relational
Schemas (5)
Rule 5: A binary many-to-many relationship type or
an n-ary relationship type is associated with a table
containing the identifiers of the entity types that
participate in all its roles
Corresponding referential integrity constraints are added
Attributes and not null constraints are also added as in
previous rules
Identifier if any is the key of the relation, but the
combination
Academic staff ofParticipates
all role identifiersProject
can be used as a key
Participates
(0,n) (1,n)
Employee no Start date Project id Employee no
... End date ... Project id
Start date
End date
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 19
Translating from ER to Relational
Schemas (6)
Rule 6: A multivalued attribute is associated with a
table containing the identifier of the entity or
relationship type to which it belongs
Corresponding referential integrity constraint is added
Primary key of table is composed of all its attributes
Academic staff Academic staff
Research area
... Employee no
Research areas (1,n) Research area
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 20
Translating from ER to Relational
Schemas (7)
Rule 7: A generalization can be translated in 3 ways
The supertype and the subtype are associated with tables
containing the identifier of the entity type. There is a
referential integrity constraint from the table of the subtype
to the table of the subtype
Academic staff Academic staff
Employee no Employee no
Name Name
... ...
Assistant Professor
Assistant Professor
Employee no Employee no
Thesis title Status Thesis title Status
... ... ... ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 21
Translating from ER to Relational
Schemas (8)
Rule 7 (cont.):
Only the supertype is associated with a table, which
contains also the attributes of the subtype (these are
optional attributes) Academic staff
Employee no
Name
…
Thesis title (0,1)
...
Status
...
Only the subtype is associated with a table containing all
attributes of the subtype and the supertype
Assistant Professor
Employee no Employee no
Name Name
… …
Thesis title Status
... ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 22
Defining Relational Schemas in SQL
create table AcademicStaff as (
EmployeeNo integer primary key,
FirstName character varying (30) not null,
LastName character varying (30) not null,
Address character varying (50) not null,
Email character varying (30) not null,
Homepage character varying (64) not null );
create table AcademicStaffResearchArea as (
EmployeeNo integer not null,
ResearchArea character varying (30) not null,
primary key (EmployeeNo,ResearchArea),
foreign key EmployeeNo references
AcademicStaff(EmployeeNo) );
…
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 23
Redundancies and Update Anomalies
Participates Affiliation
Employee no Employee no
Project id Research area
Start date Department id
End date
Project acronym
Project name
Relation with redundancies may induce anomalies in
the presence of insertions, updates, and deletions
In relation Participates the information about a project is
repeated for each staff member who works on that project
In relation Affiliation the information about a research area
is repeated for all departments to which the staff member is
affiliated
Dependencies and normal forms are used to
describe such anomalies
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 24
Functional Dependencies
Given a relation R and to sets of attributes X and Y in
R, a functional dependency X Y holds if in all
tuples of the relation, each value of X is associated
with at most one value of Y
In Participates above, Project Id {Project acronym, Project
name}
A key is a particular case of a functional dependency
A prime attribute is an attribute that is part of a key
A full functional dependency is a dependency X Y
in which the removal of an attribute from X
invalidates the dependency
A dependency X Z is transitive if there is a set of
attributes Y such that X Y and Y Z hold
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 25
Multivalued Dependencies
Given a relation R and to sets of attributes X and Y in
R, a multivalued dependency X Y holds if the
value of X determines a set of values for Y,
independently of any other attributes
In Affiliation above, Employee no Research area and
Employee no Department Id
A multivalued dependency X Y is trivial if either Y
X or X Y = R, otherwise it is nontrivial
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 26
Normal Forms (1)
Normal Form: Integrity constraint certifying that a
relation schema satisfies particular properties
In the relational model the relations are in first
normal form since all attributes are atomic and
monovalues
A relation schema is in the second normal form if
every nonprime attribute is functionally dependent
on every key
Participates Participates Project
Employee no Employee no Project id
Project id Project id Project acronym
Start date Start date Project name
End date End date
Project acronym
Project name
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 27
Normal Forms (2)
A relation is in the third normal form if it is in the
second normal form and there are no transitive
dependencies between a key and a nonprime
attribute
Assistant Assistant Professor
Employee no Employee no Employee no
Thesis title Thesis title First name
Thesis description Thesis description Last name
Advisor id Advisor id
Advisor first name
Advisor last name
A relation is in the Boyce-Codd normal form if for
every Participates
nontrivial dependency X Y, X is Location
Participates a key or
contains
Employeea no key Employee no Location
Project id Project id Project id
Start date Start date
End date End date
Location
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 28
Normal Forms (3)
A relation is in the fourth normal form if for every
nontrivial dependency X Y, X is a key or contains
a key
Affiliation Affiliation Research
Employee no Employee no Employee no
Research area Research area Research area
Department id
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 29
Weaknesses of the Relational Model
It provides a very simple data structure, which
disallows multivalued and complex attributes
complex objects are split into several tables performance
problems
The set of types provided by relational DBMSs is very
restrictive (integer, float, string, date, …)
Insufficient for complex application domains
There is no integration of operations with data
structures
It is not possible to directly reference an object by
use of a surrogate or a pointer
links are based on comparison of values performance
problems
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 30
Object-Relational Model: Motivations
Preserves the foundations of the relational model
Organizes data using an object model
Allow attributes to have complex types groups related
facts into a single row
Extensions
Complex and/or multivalued attributes
User-defined types with associated methods
System-generated identifiers
Inheritance among types
Defined in the SQL:1999 and SQL:2003 standards
Current DBMSs (Oracle, …) introduced object-
relational extensions, but do not necessarily comply
with the standard
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 31
Object-Relational Model: Definition (1)
Allow composite types, in addition to predefined
types
Complex values may be stored in a column of a table
Row type: structured values (i.e., composite attributes)
Array type: variable-sized vectors of values
Multiset type: Unordered collection of values, with duplicates
permitted (bags)
Composite types can be nested
This is considered an “advanced feature” in the standard
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 32
Object-Relational Model: Definition (2)
Two sorts of user-defined types
Distinct types: Represented internally by a predefined
type, but cannot be mixed in expressions
Does not make sense to compare a person’s name and a SSN
Structured user-defined types: class declarations in OO
languages
May have attributes, which can be of any SQL type
May have methods, which can be instance or static (class)
methods
Attributes are encapsulated through observer and mutator
functions
Both of them can be used as a domain of a column, of an
attribute of another type, or a table
Comparison/ordering of values: with user-defined methods
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 33
Object-Relational Model: Definition (3)
SQL:2003 supports single inheritance:
A subtype inherits attributes and methods from its
supertype
A subtype can
Define additional attributes and methods
Overload inherited methods: provide another definition with
a different signature
Override inherited methods: provide another
implementation with a the same signature
Final types: cannot have subtypes
Final methods: cannot be redefined
Instantiable type: includes constructor methods for
creating instances
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 34
Object-Relational Model: Definition (4)
Two types of tables
Relational tables: usual ones, where domains for
attributes are all predefined or user-defined types
Typed tables: defined on structured user-defined
types
Constraints (e.g. keys) defined in the table declaration
Have a self-referencing column that contains identifiers
(surrogates) generated by the DBMS
Row identifiers can be used for establishing links between
tables
Ref type: values are the unique identifiers
Always associated with a specified structured type
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 35
Object-Relational Model: University
Example
Academic staff Type Participates Type Project Type
Employee no Employee Project id
Name Project Project acronym
First name Start date Project name
Last name End date Description (0,1)
Address Start date
Email End date
Homepage Budget
Research areas (1,n) Funding agency
Participates (0,n) Participates (1,n)
Assistant Type Professor Type Section Type Course Type
Thesis title Status Semester Course id
Thesis description Tenure date (0,1) Year Course name
Advisor No courses Homepage (0,1) Level
Advises (0,n) Professor Sections (1,n)
Teaches (0,n) Course Has prereq (0,n)
Is a prereq (0,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 36
Translating from ER to OR Schemas (1)
Rule 1: A strong entity type is associated with a
structured type containing all its attributes
Multivalued attributes defined with the array or the
multiset type
Composite attributes defined with the row or the
structured type
A table of this structured type must be declared
Table defines not null constraints for mandatory attributes
Identifier of the entity type define Academic
Academic staff
the primary
staff Type
key of the
table
Employee no Employee no
Name Name
First name First name
Last name Last name
Address Address
Email Email
Homepage Homepage
Research areas (1,n) Research areas (1,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 37
Translating from ER to OR Schemas (2)
Rule 2: A weak entity type is transformed as a strong
entity type, but the structured type contains the
identifier of its owner
Section CouSec Course Section Type Course Type
(1,1) (1,n)
Semester Course id Semester Course id
Year ... Year ...
Homepage (0,1) Homepage (0,1)
Course
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 38
Translating from ER to OR Schemas (3)
Rule 3: A regular relationship type is associated with
structured type containing the identifiers of all
participating entity types, and all attributes of the
relationship
A table of this structured type must be declared
Table defines not null constraints for mandatory attributes
Set of identifiers of entity types define the primary key of
the table
Inverse references may also be defined
Academic staff
Type
Academic staff Project Participates
Participates Type …
(0,n) (1,n) Participates (0,n)
Employee no Start date Project id Acad. staff
... End date ... Project Project
Start date Type
End date
…
Participates (1,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 39
Translating from ER to OR Schemas (4)
Rule 4: Only single inheritance is supported by the
model
multiple inheritance must be removed by
transforming some generalization links into binary
one-to-one relationships, to which Rule 3 is applied
Assistant Student Assistant Student
Assistant Type Major Assistant Type Major
... ... ... ...
PhD Student
PhD Student PhD Student
Thesis title Thesis title
... …
Student
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 40
Defining Object-Relational Schemas in
SQL
create type AcademicStaffType as (
EmployeeNo integer,
Name row (
FirstName character varying (30),
LastName character varying (30) ),
Address character varying (50),
Email character varying (30),
Homepage character varying (64),
ResearchAreas character varying (30) multiset,
Participates ref(ParticipatesType) multiset scope Participates
references are checked )
ref is system generated;
create table AcademicStaff of AcademicStaffType (
constraint AcademicStaffPK primary key (EmployeeNo),
Email with options constraint AcademicStaffEmailNN Email not null,
ref is oid system generated );
…
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 41
Physical Database Design
Specifies how database records are stored,
accessed, and related in order to ensure adequate
performance of a database application
Requires to know the specificities of an application,
properties of data, and usage patterns
Involves analyzing the transactions/queries that run
frequently, that are critical, and the periods of time
in which there will be a high demand on the
database (peak load)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 42
Performance of Database Applications
Factors for measuring performance of database
applications
Transaction throughput: # transaction processed in a time
interval
Response time: Elapsed time for the completion of a transaction
Disk storage: Space required to store the database file
A compromise has to be made among these factors
Space-time trade-off: Reduce time to perform an operation by
using more space, and vice versa
Query-update trade-off: Access to data can be more efficient by
imposing some structure upon it
However the more elaborate the structure, the more time is
needed to built it and maintain it when content changes
After initial physical design is done, it is necessary to
monitor it and tune it
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 43
Data Organization
A database is organized in secondary storage into files,
each composed of records, at their turn composed of
fields
In a disk, data is stored in disk blocks (pages)
Set by the operating system during disk formatting
Transfer of data between main memory and disk takes
place in units of disk blocks
DBMSs store data on database blocks (pages)
Selection of a database block depends on several issues
Locking granularity may be at the block level, not the record
level
For disk efficiency, database block size must be equal to or be a
multiple of the disk block size
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 44
File Organization
Arrangement of data in a file into records and blocks
Heap files (unordered files): Records placed in the
file in the order as they are inserted
Efficient insertions, slow retrieval
Sequential files (ordered files): Records sorted on the
values of ordering fields
Fast retrieval, slow inserting and deleting
Hash files: Use a hash function that calculates the
block (bucket) of a record based on one or several
fields
Collision: Bucket is filled and a new record must be inserted
Fastest retrieval, but collision management degrades
performance
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 45
Indexes
Additional access structures that speed up retrieval
of records in response to search conditions
Provide alternative ways of accessing the records
based on the indexing fields on which the index is
constructed
Many different types of indexes
Clustering vs nonclustering: Whether records in the file are
physically ordered according to the fields of the index
Single column vs multiple column, depending on the number
of indexing files
Unique vs nonunique: Whether duplicate values are allowed
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 46
Indexes, Clustering
Types of indexes (cont.)
Sparse or dense: Whether there are index records for all
search values
Single-level vs multilevel: Whether an index is split in
several smaller indexes, with an index to these indexes
Multi-level indexes often implemented by using B-trees of B+-
trees
Clustering: Tables physically stored together as they
share common columns (cluster key) and are often
used together
Improves data retrieval
Cluster key stored only once storage efficiency
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 47
Outline
Database Concepts
Steps in Database Design
Entity-Relationship Model
Logical Database Design
Relational Model
Object-Relational Model
Physical Database Design
Data Warehouse Concepts
Logical Data Warehouse Design
Physical Data Warehouse Design
Data Warehouse Architecture
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 48
Data Warehouses: Definition (1)
Operational databases (online transaction processing
systems or OLTP), are not suitable for data analysis
Contain detailed data, do not include historical data, perform
poorly for complex queries due to normalization
Data warehouses address requirements of decision-
making users
A data warehouse is a collection of subject-oriented,
integrated, nonvolatile, and time-varying data to
support data management decisions
Subject-oriented: focus on particular subjects of
analysis (e.g., purchase, inventory, and sales in a retail
company)
Operational databases focus on specific functions of an
application
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 49
Data Warehouses: Definition (2)
Integrated: Join together data from several
operational and external systems problems due to
differences in data definition and content
In operational DBs these problems are solved in the design
phase
Nonvolatile: Data modification and removal is
disallowed scope of the data is long period of time
Operational DBs keep data for a short period of time, as
required for daily operations
Time-varying: Keep different values for the same
informa-tion and the time when changes to these
values occurred
Operational DBs typically do not have temporal support
Online analytical processing (OLAP): Allows decision-
making users to perform interactive analysis of data
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 50
Multidimensional Model
DWs and OLAP use a multidimensional view of data
Represented as a data cube or an hypercube
Dimensions: Perspectives for analyzing data
Cells (facts): Contain measures, values that are to be
analyzed
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33
Paris
Q1 21 10 18 35 measure
Time (Quarter)
values
Q2 27 14 11 30
dimensions
Q3 26 12 35 32
Q4 14 20 47 31
games DVDs
books CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 51
Hierarchies
Data granularity: Level of detail of measures
Data analyzed at different granularities (abstraction
levels)
Hierarchies relate low-level (detailed) concepts to
higher-level (general concepts)
Example: Store – City – Region/Province – Country
Given two related levels in a hierarchy, lower level is
called child, higher level is called parent
Instances of these levels are called members
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 52
Kinds of Hierarchies
Balanced: Same number of levels from leaf members to
root
Store dimension All
Country level France ... Italy
Region/ Ile-de-France Provence-Alpes- Lazio Lombardy
Province level Côte d'Azur ... ...
City level Paris Nice Rome Milan
Store level Store 1 Store 2 Store 3 ... Store 10 Store 11 Store 12
Noncovering (ragged): Parent of some members does not
belong to the level immediately above
E.g., small countries do not have Region/Province level
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 53
Kinds of Hierarchies
Unbalanced: May not have instances at the lower
levels
E.g., some cities do not have stores, sales are through
wholesalers
Parent-child (recursive): Involve
Lepage twice the same level
Ivanov Claus
Vancamp Rochat Pirlot Quintin
Jottard Nguyen Weiss Spiegel Praet
Hallez Berger
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 54
Measure Aggregation and
Summarizability
Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels
Summarizability conditions ensure correct
aggregation
Disjointness of instances: Grouping of instances in a level
with respect to the parent in the next level must result in
disjoint sets
Completeness: All instances are included in the hierarchy
and each instance is related to one parent in the next level
Correct use of aggregation functions: Type of measures
determine the kind of aggregation functions that can be
applied
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 55
Measure Classification: Additivity
Additive measures (flow or rate measures): Can be
meaningfully summarized using addition along all
dimensions
E.g., sales amount can be summarized when the hierarchies
in Store, Time, and Product dimensions are traversed
Semiadditive measures (stock or level measures):
Can be meaningfully summarized using addition
along some (not all) dimensions
E.g., inventory quantities, can be aggregated in the Store
dimension, but cannot be aggregated in the Time dimension
Nonadditive measures (value-per-unit measures):
Cannot be meaningfully summarized using addition
along any dimension
E.g., item price, cost per unit, exchange rate
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 56
Measure Classification: Aggregation
Complexity
Distributive measures: Defined by an aggregation
function that can be computed in a distributed way
Data is partitioned into n sets, aggregate function applied to
each set, aggregated value is computed by applying a
function to these n subaggregate values
E.g., count, sum, min, max
Algebraic measures: Defined by an aggregation
function that has a bounded number of arguments,
each is obtained by applying a distributive function
E.g., average, computed from sum and count, which are
distributive
Holistic measures: Cannot be computed from other
subaggregate values
E.g., median, mode, rank
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 57
OLAP Operations: Roll up
Transforms detailed measures into summarized ones
when one moves up in a hierarchy
Milan 24 18 28 14
Rome 33 25 23 25 Italy 57 43 51 39
Nice 12 20 24 33 France
Paris Q1 33 30 42 68
Time (Quarter)
Q1 21 10 18 35 Roll-up to the Country level
Time (Quarter)
Q2 27 14 11 30
Q2 27 14 11 30
Q3 26 12 35 32
Q3 26 12 35 32
Q4 14 20 47 31
Q4 14 20 47 31
games DVDs
games DVDs books CDs
books CDs Product (Category)
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 58
OLAP Operations: Drill down
Opposite to the roll-up operation, i.e., it moves from
a more general level to a detailed level in a
hierarchy
Milan 24 18 28 14 Milan 8 6 9 5
Rome 33 25 23 25 Rome 10 8 11 8
Nice Nice 4 7 8 10
12 20 24 33
Paris Paris
Q1 21 10 18 35 Drill-down to Jan 7 2 6 13
Time (Quarter)
the
Time (Quarter)
Q2 27 14 11 30 Feb 8 4 8 12
Month level
Q3 26 12 35 32 Mar 6 4 4 10
Q4 14 20 47 31 ... ... ... ... ...
games DVDs Dec 4 4 16 7
books CDs games DVDs
Product (Category) books CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 59
OLAP Operations: Pivot or Rotate
Rotates the axes of a cube to provide an alternative
presentation of the data
Milan 24 18 28 14
Rome 33 25 23 25 DVDs 35 30 32 31
Nice 12 20 24 33 CDs 18 11 35 47
games 10 14 12 20
Paris
books
Q1 21 10 18 35
Time (Quarter)
Paris 21 27 26 14
Store (City)
Q2 27 14 11 30
Pivot Nice 12 14 11 13
Q3 26 12 35 32
Rome 33 28 35 32
Q4 14 20 47 31
Milan 24 23 25 18
games DVDs
Q1 Q2 Q3 Q4
books CDs
Product (Category) Time (Quarter)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 60
OLAP Operations: Slice
Performs a selection on one dimension of a cube,
resulting in a subcube
Milan 24 18 28 14
Rome 33 25 23 25 Q1 21 10 18 35
Time (Quarter)
Nice 12 20 24 33
Paris Q2 27 14 11 30
Q1 21 10 18 35
Time (Quarter)
Q3 26 12 35 32
Q2 27 14 11 30 Slice on Store.City = ‘Paris’
Q4 14 20 47 31
Q3 26 12 35 32 games DVDs
Q4 14 20 47 31 books CDs
Product (Category)
games DVDs
books CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 61
OLAP Operations: Dice
Defines a selection on two or more dimensions, thus
again defining a subcube
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33 Nice 12 20 24 33
Paris Paris
(Quarter)
Q1 21 10 18 35 Q1 21 10 18 35
Dice on Store.Country =
Time (Quarter)
Time
Q2 27 14 11 30 ‘France’ and Time.Quarter=
Q2 27 14 11 30
‘Q1’ or ‘Q2’
Q3 26 12 35 32 games DVDs
books CDs
Q4 14 20 47 31
Product (Category)
games DVDs
books CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 62
Other OLAP Operations
Drill-across: Executes queries involving more than
one cube, which share at least one dimension
E.g., compare sales data in a cube with purchasing data in
another one
Drill-through: Allows one to move from data at the
bottom level in a cube, to data in the operational
systems from which the cube was derived
E.g., for determining the reason for outlier values in a data
cube
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 63
Implementations of a Multidimensional
Model
Relational OLAP (ROLAP): Store data in relational
databases, support extensions to SQL and special
access methods to efficiently implement the model
and its operations
Multidimensional OLAP (MOLAP): Store data in
special data structures (e.g., arrays) and implement
OLAP operations in these structures
Better performance than ROLAP for query and aggregation,
less storage capacity than ROLAP
Hybrid OLAP (HOLAP): Combine both technologies,
E.g., detailed data stored in relational databases,
aggregations kept in a separate MOLAP store
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 64
Logical Data Warehouse Design
In ROLAP systems, tables organized in specialized
structures
Star schema: One fact table and a set of dimension tables
Referential integrity constraints between fact table and
dimension tables
Dimension tables may contain redundancy in the presence of
hierarchies
Snowflake schema: Avoids redundancy of star schemas
by normalizing dimension tables
Normalized tables optimize storage space, but decrease
performance
Starflake schema: Combination of the star and snowflake
schemas, some dimensions normalized, other not
Constellation schema: Multiple fact tables that share
dimension tables
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 65
Logical DW Design: Star Schema
Product Store
Product key Store key
Product number Store number
Product name Sales Store name
Description Store address
Size Product fkey Manager name
Category name Store fkey City name
Category descr. Promotion fkey City population
Department name Time fkey City area
Department descr. Amount State name
... Quantity State population
State area
State major activity
Promotion Time ...
Promotion key Time key
Promotion descr. Date
Discount pct. Event
Type Weekday flag
Start date Weekend flag
End date Season
... ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 66
Logical DW Design: Snowflake Schemas
Product Category Department
Product key Category key Department key
Product number Category name Department name
Product name Description Description
Description Department fkey ...
Size ...
Category fkey
...
Promotion Sales Time
Promotion key Product fkey Time key
Promotion descr. Store fkey Date
Discount pct. Promotion fkey Event
Type Time fkey Weekday flag
Start date Amount Weekend flag
End date Quantity Season
... ...
Store
City State
Store key
Store number City key State key
Store name City name State name
Store address City population State population
Manager name City area State area
City fkey State fkey State major activity
... ... ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 67
Logical DW Design: Constellation
Schemas
Promotion Sales Store
Promotion key Product fkey Store key
Promotion descr. Store fkey Store number
Discount pct. Promotion fkey Store name
Type Time fkey Store address
Start date Amount Manager name
End date Quantity City name
... City population
City area
State name
Product Time State population
State area
Product key Time key State major activity
Product number Date ...
Product name Event
Description Weekday flag
Size Weekend flag
Category name Season
Category descr. ...
Department name
Department descr.
...
Purchases Supplier
Product fkey Supplier key
Supplier fkey Supplier name
Order time fkey Contact person
Due time fkey Supplier address
Amount City name
Quantity State name
Freight cost ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 68
Physical Data Warehouse Design: Views
(1)
Data warehouses are several orders of magnitude larger
than operational database
Physical design is crucial for ensuring adequate response
time for complex queries
View: Table derived from a query
Query may involve base tables or other views
Materialized view : View physically stored in a database
Enhances query performance by precalculating costly operations
This is achieved at the expense of storage space
Updating materialized views: Modifications of underlying
base tables must be propagated into the view
Must be performed incrementally to ensure efficiency
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 69
Physical Data Warehouse Design: Views
(2)
In a DW not all possible aggregations can be
materialized
Number of aggregations grows exponentially with the number of
dimensions and hierarchies
Selection of materialized views: Identify the queries to
be prioritized, which determines the views to be
materialized
DBMSs are providing tools that tune the selection of
materialized views given previous queries
Queries must be rewritten in order to best exploit
existing materialized views to improve response time
Drawback of materialized view approach: Requires one
to anticipate queries to be materialized
DW queries are often ad hoc and cannot always be anticipated
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 70
Physical Data Warehouse Design: Indexes
(1)
Traditional indexes are not appropriate for data
warehouses
Designed for OLTP transactions that access a small number
of tuples
Bitmap indexes: For columns with a low number of
distinct values
Each value is associated with a bit vector whose length is
the number of records in the table
ith bit is Professor
set if ith row has that value in the column
Index on Rank
Small
PId in size, logical
Name ... Rank operators
Rec# used toAssociate
Assistant manipulate
Full
them
P1 John Smith ... Assistant 1 1 0 0
P2 Ada Jones ... Associate 2 0 1 0
P3 Bill Kock ... Full 3 0 0 1
P4 Ed Low ... Associate 4 0 1 0
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 71
Physical Data Warehouse Design: Indexes
(2)
Join indexes: Materialize relational join by keeping
pairs of row identifiers that participate in the join
Index on Sales Index on
Product Client
P1,C1,...
P1 C1
P1,C2,...
P2 C2
P2,C1,...
P3 ...
P3,C2,...
...
...
Index selection: Select indexes to minimize
execution cost of a series of queries and updates
DBMSs are providing tools to assist DBAs to this task
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 72
Fragmentation or Partitioning
Divide the contents of a relation into several files
that can be more efficiently processed
Vertical fragmentation: Splits attribute of a table into
groups that are stored independently
E.g., most often used attributes in one partition
Horizontal fragmentation: Divides records of a table
into groups according to a particular criterion
Range partitioning: Partitioned wrt range of values
E.g., partitioning based on time, each partition contains data
about a particular time period (year, month)
Hash partitioning: Partitioned wrt hash function
Combined partitioning: Combine these two methods
E.g., rows first partitioned according to range values, then
subdivided using a hash function
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 73
Data Warehouse Architecture (1)
Internal Data staging Metadata
sources
OLAP tools
Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases
Statistical
tools
External
sources Data marts
Data mining
tools
Data Back-end Data warehouse OLAP Front-end
sources tier tier tier tier
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 74
Data Warehouse Architecture (2)
Data sources
Operational databases
Other internal or external sources of information (e.g. files)
Back-end tier
Extraction-Transformation-Loading (ETL) tools for
manipulating data from sources
Data staging area: Intermediate database where
manipulation is done
OLAP tier
OLAP Server: Supports multidimensional data and operations
Front-end tier: Deals with data analysis and
visualization
Composed of OLAP tools, reporting tools, statistical tools,
data-mining tools, …
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 75
Back-End Tier
Extraction: Gathers data from multiple heterogeneous
data sources
May be operational databases or files in various formats
May be internal or external to the organization
Uses APIs such as ODBC, JDBC, … for achieving interoperability
Transformation: Modifies data to conform to the data
warehouse format
Cleaning: Removes errors, inconsistencies, format transformation
Integration: Reconciles data from different sources
Aggregation: Summarizes data according to the granularity (level
of detail) of the DW
Loading: Feeds the DW with transformed data
Also includes refreshing the data warehouse at a specified
frequency
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 76
Data Warehouse Tier
Enterprise data warehouse: Centralized DW that
encompasses all areas in an organization
Data mart: Specialized DW targeted to a particular
functional area or user group
Their data can be derived from the enterprise DW or collected
from data sources
Metadata repository: Describes the content of the DW
Business metadata: Meaning (semantics) of data, organization
rules, policies, constraints, …
Technical metadata: How data is structured/stored in the
computer
Data sources, data warehouse, and data marts: logical and physical
schemas, security information, monitoring information …
ETL process: Data lineage (trace to sources), rules, defaults, refresh
and purging rules, algorithms for summarization, …
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 77
OLAP Tier
OLAP servers that provides multidimensional view
from DWs and data marts
Can be ROLAP, MOLAP, or HOLAP
Most database products provide OLAP extensions
and related tools for manipulating cubes
However, no standardized language for querying
data cubes
Oracle uses Java and query language OLAP DML
SQL Server uses .NET and query language MDX
XMLA (XML for Analysis) aims at providing a common
language for exchanging multidimensional data
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 78
Front-End Tier
OLAP tools: Allow interactive exploration and
manipulation of the warehouse data
Facilitate formulation of ad hoc queries (no prior knowledge
of them)
Reporting tools: Enable production, delivery and
management of reports (paper and web-based)
Use predefined queries
Statistical tools: Used to analyze and visualize the
cube data using statistical methods
Data-mining tools: Allow users to analyze data to
discover patterns, trends, enable predictions
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 79
Variations of the Architecture
Enterprise DW without data marts, or data marts without
enterprise DW
Building enterprise DW is costly, building a data mart is easier
Integrating independently created data marts is costly
No OLAP server: Client tools access directly DW
Extreme situation: No DW either, client tools access
sources
Virtual DW: Set of materialized views over data sources
Easier to build, but does not provide a real DW solution
Does not contain historical data, no centralized metadata, no
possibility to clean and transform the data
No data staging area: If the source systems conform
very closely to the data in the DW
Rarely the case in real-world situations
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 80